In this Knight Foundation funded proof of concept we set out to examine whether local attitudes expressed in online discussions could be correlated with local socioeconomic and demographic data.
We chose unemployment as our topic because it is a key element of the 2012 campaign and it generates prolonged discussions on both liberal and conservative sides. We expected it would give us a large enough set of local samples to work with.
After some experimentation, we decided our primary sampling for online attitudes would be blogs with user comments – and the more the better. Local journalists often take the perspective of local values, and local bloggers often take on the task of spreading local news. We found that it was far more important to target sources with lively local discussions, than to insist on differentiating between blogs and local news. The goal was capturing the conversation.
To enable comparison of attitudes extracted from online discussions to socioeconomic data, we followed strict statistical rules in making sure our samples were both as randomized and as representative as possible. In other words, relative to online discussions, sources had to have no hidden biases and also have enough comments to reflect the local populace.
As our primary target for local level analysis, we selected three swing states with different levels of unemployment: Florida, Ohio and Virginia. We also analyzed two ideologically pided national level samples, mostly for reference purposes, extracted from two high profile blog spots, one liberal and one conservative. And to get a better sense of more localized differences we looked at two county-level samples, from Florida comparing results from a large, perse metro area, Miami, and small county, Citrus, from the same state with approximately same level of unemployment. We restricted the time span to the current elections campaign, with national level blogs limited to 2012, and state and county level blogs extending back into 2011.
For the state level comparison, after long search through the wild woods of local blog spots – many of which just pure opinion from the left or right or just mostly reprint AP news – we selected the Patch.com network. Compared to others, it provided a unique blend of advantages, including balanced straight-news coverage over 20 counties per state, featuring local posts only and with lots of comments, and full transparency in author affiliations. We pulled out unemployment related content for all counties listed under the targeted states.
We used a web-scraping approach to extract structured content in the cleanest possible way. After removing doubles, job ads and the like, we passed the content through the standard text processing procedure to extract distribution frequencies of words and phrases, separately for main texts and comments.
We covered all posts containing an attitude related to unemployment, even if all remaining paragraphs were not about unemployment. Our basic interpretation was that as long as the rest of the text could be related to unemployment, it provided a background context, relevant to the research of networks of associated concepts. The total sample size was 2,600 posts, with close to 100,000 comments and over 2.6 million words (including some 10% 2-grams). For more details on our methodology visit this link.
We performed three different types of analysis across the extracted samples:
• We compared how relative word frequencies (number of occurrences per 1000 words across the sample, including comments) correspond with socioeconomic statistics – things like unemployment, homelessness, federal expenditures. We looked into groups of concepts (related to standard of living, unions, religion, etc.) rather than inpidual concepts and considered results significant if whole groups were corresponding in the same way.
• In a separate analysis, we developed a metric for scoring words most associated with extensive discussions
• We also performed sentiment analysis for main texts and for comments, both for positive / negative polarization and for the six basic emotions and compared the sentiment analysis ratios across samples.
Word Frequencies vs. Socioeconomic Statistics
We found statistically significant correlations of groups of related words not only with unemployment rates but also other statistical data that are connected to unemployment, including union memberships, homeless rates, commuting time, etc.
We also investigated some unexpected results, and found some interesting stories behind them. One notable example was the high incidence of foreclosure related terms in Virginia, the state least affected by foreclosure crisis, which turned out to be connected to The Emergency Homeowners' Loan Program (EHLP). This program offered interest free loans for homeowners affected by involuntary unemployment, and among the our target states only Virginia was included. Florida and Ohio, much more affected by the crisis, both received much larger loan packages, but on the market level, not specifically related to unemployment, so these didn't show in our results. This finding was not only interesting on its own, it suggested a set of stories that news organizations – national and local – might want to explore further as the nation’s housing crisis wears on.
While researching online discussions, we noted great differences in numbers of comment entries and wondered what fuels prolonged intensive discussions. We developed a method for weighting word frequencies by comment numbers to capture this important trait, the involvement index.
The two word clouds below (showcasing the use of involvement index to rank words in Miami and Citrus County samples),offer a direct view into the network of words and concepts involved in intensive discussions connected to unemployment and into how those words and concepts vary by community:
More details on Involvement Index methodology can be found here.
As a consistent tendency across all seven researched samples (states, liberal/conservative blogs and counties), we found that comments were significantly more negative than main post texts. Also more negative main texts tended to generate more comments. In the State level samples, we found some interesting traits (for example 'anger' corresponding to crime rates). But on the whole, we found that the standard set of emotions commonly used in sentiment analysis (anger, joy, sadness, disgust, fear and surprise) was too rough to capture fine grain differences between states. More results are available here.
The Internet and the Web are awash in conversation – from blogs to user comments on stories – but that world of online discussion does not have to exist in a walled-off wired realm. In this project, we found that online discussions, when carefully extracted and processed, can provide a straightforward way to see inside of the lives of different communities. When we compared those conversations to real facts and figures, we learned how different levels of unemployment hit people in their everyday lives, how it hit their communities and how those impacts changed how they saw the issue of unemployment – the framework they use to understand what it means.
Now that we’ve seen that these frameworks can be extracted from the words they use online, there is no reason why the same kind of approach wouldn't deliver equally interesting results when implemented to other targeted areas, or even other types of investigation.
Perhaps our greatest discovery was that we saw much more in the data than we had imagined. What we learned actually suggested stories and issues we had not considered in our previous analyses of these states and communities. In this proof of concept research we concentrated on detecting correlations on one topic. But if carried out with big enough samples, this kind of approach opens a door to answering a much more intriguing and bigger question: what else is going on out there that we are missing?