Understanding attitudes is notoriously difficult in statistical analysis. They can be expressed in many different ways. Sentences can have very different meanings when interpreted by different mindsets in different contexts. As poll research has repeatedly shown, the smallest changes in wording used in questionnaires can greatly affect results. This is even more true of online attitudes, where interpretation of individual sentences often relies on almost uncontrollable wider contexts.
As a consequence of this complexity, questions of the most basic kind, like 'What do people think about X?' have a limited scope of relevancy. They may be applicable to some special cases, like brand related like/dislike polarization, but any deeper attitude analysis requires questions of a different kind:
'When people write about X to what extent do they also write about A, B, C...?'
What can be captured in this way are not quite attitudes, as they are usually expressed, but, more directly, underlying networks of associated concepts, the associational framework. There is a large body of psychological evidence that these frameworks greatly influence our everyday choices, and that fact is extensively exploited in marketing (for example, through associating some product with something else known to be generally liked). So, by understanding these associations we are actually tapping into the background field, the frame, from which attitudes draw their dependencies.
There are some additional reasons these associational frameworks are important as far as research is concerned. Unlike attitudes, usually encapsulated into uniquely structured sentences, associational frameworks can be represented as a list of concepts and their relations, so they can be more readily aggregated. In terms of potential influence, and unlike attitudes, these concept networks can be transferred with little or no resistance. We tend to defend our attitudes, but hardly ever object newly associated terms, which then in their own right can alter our attitudes through extending the related association field in new directions. Consequently, while monitoring shifts in these association frameworks we might potentially be looking into changes in attitudes before they eventually become evident.
Researching these frameworks translates into investigating correlations, not causations. In other words, based on measuring what x% of people put in their blogs, we cannot conclude what x% of people will think or might do; but we can put some restraints on what we can reasonably expect them to think or do.
Researching Online Discussions
In this pilot research, carried out under a Knight Foundation grant, we set out to examine networks of associated concepts extracted from online discussions related to unemployment. More specifically, we looked into differences between geographic units, relative to socioeconomic and demographic statistics, and between ideological standpoints. We targeted three levels of granularity:
- national level, through comparing two high profile blog spots, one liberal and one conservative
- state level, where we compared results from three swing states, with different levels of unemployment
- county level, comparing results from a large city and small county with approximately same level of unemployment
When it comes to expressing attitudes blogs are a natural place to look. They are far more elaborate in their structure and their arguments than social media such as Facebook or Twitter, and often initiate prolonged discussions on the subject. Rather than being something of a competitor, these newer forms of online social networking actually boost the visibility of blogs, through massive use of their sharing tools, contributing to the general trend of increase in blogs relative influence.
Blogs are usually closely related to news. Not only do many blog posts draw their inspiration from daily news, they extensively quote news and often start discussions by simply providing an angle of interpretation for quoted news.
Bloggers also share some other important traits with journalists. They are generally far more educated and more informed than general population. Relative to their potential influence, their views eventually find their way, to some extent, into the popular views. They also often reflect the messaging and ideas of the political left and right. In many ways they are the newspaper columnists of the 21st century.
The difference between blogs and news is especially blurred on the local level, as local journalists often tend to take the perspective of local values, and local bloggers often take on the task of spreading the local news. In this work we found that, while on the national level it was useful to concentrate on blogs only, to get more original views on general matters; on the local level it was far more important to target sources with lively local discussions, than to insist on differentiating between blogs and local news. What made the difference was the local relevance of posts, as expressed in the number and content of comments, rather than the formal professional affiliation of their authors.
We chose unemployment as our topic because it has been in the focus of political debate for a very long time, generating prolonged discussions on both liberal and conservative sides, so we could expect to get large enough samples to work with. Unemployment also has the advantage of not containing any internal bias: it is generally univocally considered a bad thing, so we could use it as a relatively fixed pivot point while examining its pairings with related issues and accompanied sentiments.
After compiling a list of most popular and influential blog spots for each category, we manually inspected a number of blog website and excluded those that were generating too few results in the initial search for 'unemployment' and those that had almost zero comments. We generally restricted the time span to the 2012 elections campaign, with national level blogs limited to 2012, and state and county level blogs extending back into 2011.
- For liberal versus conservative comparison on national level we selected Dailykos.com and Blaze.com blog websites, based on their relative independence from party establishment controlled policies, and abundance of comments.
- For state level comparison, after a long search through the wild woods of local blog spots, many of which are controlled by only a few people, often with hidden agendas and often just reprinting AP news, we selected the Patch.com network. Compared to others, it provided a unique blend of advantages, including general transparency (authors have to specify their affiliation, which results in generally more neutral tone), the fact that the local sites include only local posts with lots of comments, and balanced coverage of a few dozen counties for each state. We targeted three states with different unemployment rates, heavily focused in the current election campaign: Florida, Ohio and Virginia. We extracted unemployment related content for all counties listed for those states.
- For county level comparison, we looked in Florida, the state with highest unemployment rate, and selected two large enough blog sites covering counties with approximately the same unemployment rate, Miami (Miami Herald blogs) and Citrus County (Citrus County Chronicle blogs).
From these selected blog websites we extracted lists of posts containing 'unemployment' either in the title or in the text, or (repeatedly) in comments, using local website search. After creating lists and deleting doubles (mostly from Patch.com network, which often repeats the same posts across neighboring counties), we used the web scraping approach to extract the blogs' content in cleanest possible way.
We tested a number of open source software packages for web scraping and finally selected Web Harvester. It implements a full scope of xPath potential for specifically targeting only the content that is useful: title, subtitle, category, date, author, main text, comments text, comments number, etc. This method initially includes a significant amount of 'manual detective work' (as every element for every blog spot has to be targeted by its class or id in the html code), but it makes everything much simpler later on. Also, and unlike some other packages, it will work on every blog-spot.
We passed the extracted content through the standard text processing procedure, including tokenization, filtering stop words, case transformation, stemming and creating n-grams, using RapidMiner, a great open source software for data mining in general, with a very potent set of text mining plug-ins. We eliminated posts not really discussing unemployment in the context of policy or an issue (such as posts for job hunters, etc.) using standard supervised machine learning tools included in RapidMIner.
We kept as relevant all posts which contained an attitude related to unemployment, even if all remaining paragraphs were not about unemployment. Our basic interpretation was that as long as the rest of the text could be related to unemployment, it provided a background context, relevant to the research of networks of associated concepts. The only exception were two very long discussions, including hundreds of comments, with vast majority of the content completely unrelated to unemployment, which we manually removed after carefully considering their potential to significantly distort the overall result by their sheer size.
The total sample size was 2,600 posts, with close to 100,000 comments and over 2.6 million stemmed words (including some 10% 2-grams). At the national level this included 1,300 posts, over 84,000 comments and 1.8 million words. At the state level (for 3 states) this included close to 1,000 posts, 15,000 comments and 700,000 words, and at county level (for 2 counties), it included 300 posts with 700 comments and 90,000 words. Samples for each level were evenly balanced, some containing slightly more posts, some more comments, but generally having approximately the same number of words.
While researching online discussions, we noted great differences in numbers of comment entries and wondered what fuels prolonged intensive discussions. In other words, is there some general trait in that some blogs generate hundreds of comments, and some have none? Converted into associational frameworks, this question takes the form 'Which (networked) concepts co-occurred with prolonged and deeply involved discussions, having a profound influence on exchanging association networks?'
To answer that question, we used comment numbers to weight each word appearance, then ordered them by aggregated weighted frequencies, then extracted only top 10% (to eliminate random fluctuations), and then ordered them by average values for each instance (to eliminate most common words repeating in all discussions). We called the result the engagement index rank, and found it a useful tool to capture what researched communities take special interest in with their online discussions.
We combined main texts and comments, as in most cases it was impossible to discern what fueled discussions more, prominent titles, implications of the main text or prominent comments generating huge numbers of replies.
In future research it would be interesting to compare these engagement measures to total references from other blogs and total page views, in the wider perspective of general impact of these online discussions.
We also found that comment numbers are closely related to the extent to which blogs ride the general wave of public interest. Those waves of interest can be measured through surges in related Twitter traffic and through tracking the use of online search engines and accounted for relative to the targeted research goal, and should be incorporated in any longitudinal study of this kind.
We performed three different types of analysis across the extracted samples:
- We compared how relative word frequencies (number of occurrences per 1000 words across the sample, including comments) correspond with socioeconomic statistics. We looked into groups of concepts (related to standard of living, unions, religion, etc.) rather than individual concepts and considered results significant if whole groups were corresponding in the same way.
- In a separate analysis, after generating engagement index ranks, and after removing a few words which could be misinterpreted outside the context (surnames of influential local businessman, like 'fox' or 'baker'), we extracted the top 100 words as the ones most co-occurring with extensive discussions, and converted them into word clouds, representing the engagement associational frameworks.
- For the sentiment analysis, we processed separately main texts and comments through the R-Sentiment package (including positive/negative and 6 basic emotions), and compared the sentiment analysis ratios across samples. We used the engagement index for weighting results as an additional tool for comparison.
We performed cross validation tests (using 70% sample segments) for the most important results and found the results were less prominent, but still there, so we could presume that with larger samples they could become even more prominent. Findings which did not pass the validation test were simply disregarded. One notable example was all references to politicians on the state level, which we found slightly misbalanced, as a consequence of our targeting unemployment and not politics per se, so we excluded them from our state level results. On the national level the sample was more balanced related to politics, so we could use results directly related to politics there, though only as a marginal reference.
Word Frequencies vs. Socioeconomic Statistics
This kind of analysis could be conclusively carried out on the state level only. We did some analysis on the national level and county level, but for illustration purposes only.
We started out by verifying the quality of the sample by comparing words expected to behave relatively evenly across all state samples. We looked into highly neutral words (like 'thousand', 'yesterday', 'etc'), neutral verbs (like 'fix', 'choose'), general evaluation pairs ('good', 'bad', 'strong', 'weak') and found very balanced outcomes.
Then we looked into correlations with socioeconomic data. We found statistically significant correlations throughout the full spectrum of term frequencies (presented here as number of occurrences per 1000 words). For words with high frequencies, and accordingly relatively small error margin, we looked for close similarities with socioeconomic statistics. For small frequencies we have taken into account only words evenly spread across the sample, so that even though we could expect relatively high error margins (sometimes exceeding 30%), we could still use the results if differences between samples were much higher, e.g. exceeding 100%, and still following the same general distribution order. As we were examining groups of words, rather than individual occurrences, this correspondence in ordering of all related terms added significant additional weight.
Here are some of the expected correlations with unemployment rates:
Among less expected results, we found that unemployment rates correlated with frequencies of basic provision terms (like 'cloth', 'gas', 'rent') and other terms related to basic survival, in spite of state-level poverty rates pointing in a different direction (Ohio leading with 12.3, Florida 11.1 and Virginia 9.2 )
We found similar correspondences with other term groups (including compassion, religion etc.). We also looked into some interesting opposite correlations:
We then moved on to examine other traits related to unemployment and found correlations with union membership percentages, homeless rate, etc. Here are some examples for union membership data:
and commuting time data:
We also investigated some unexpected results, the most notable one being the high incidence of foreclosure rate related terms in Virginia:
We could expect that these frequencies to be in the range of 0.2 to 0.5 per thousand words, and Ohio and Florida results followed this, and in the order correlated with foreclosure rates, but there was a huge spike in VA. After examining the actual word occurrences in the matrix, we found a number of posts related to The Emergency Homeowners' Loan Program (EHLP). This program offered interest free loans for homeowners affected by involuntary unemployment, and among our target states only Virginia was included in EHLP. Florida and Ohio, much more affected by the crisis, both received larger loan packages, but on the market level, and not specifically related to unemployment, so these didn't show in our results. This investigation also turned out to be an interesting story lead from the journalistic perspective.
The full list of results will be presented separately in several blogs on NPR.
Engagement Association Frameworks
The two word clouds below (engagement association frameworks for Miami and Citrus county), offer a direct view into the network of concepts associated with intensive discussions connected to unemployment:
These two word clouds also showcase why it is difficult to compare results on the county level. Although these two counties have almost identical unemployment rates, there are almost no connecting points. While in Miami the discussion is centered around big, state level issues, like outsourcing, passing bills and so on, in Citrus county participants concentrate on highly local matters.
Here are the results we obtained after running the Dailykos.com (liberal) and Blaze.com (conservative) samples through the standard R-Sentiment package (values representing relative normalized indexes for each sample):
We found, as a consistent tendency across all 7 researched samples, that comments are significantly more negative than main post texts. Also, when they generate more comments, main texts tend to be more negative (Main Text Weighted column below)
(0-1=negative; 1-2=neutral; >2=positive)
As it can be seen from the above example, when not polarized across ideological lines, main texts tend to remain in the neutral territory (which could be partly attributed to the transparency policy Patch.com insists on, but we found the same tendency in Citrus and Miami). But not the comments. Comments also generally tend to be more angry and more sad.
In the State level samples, we found some interesting traits (for example 'anger' corresponding to crime rates), but we generally found the 6 emotion division too rough to capture fine grain differences between states.
We found that online discussions, when carefully extracted and processed, provide a straight forward way to understand the way different levels of unemployment hit people in their everyday lives, how it hit their communities, how their interpretation frameworks correspond with their socioeconomic environment and ideological preferences, and how these frameworks can be extracted from the words they use. There is no reason why the same kind of approach wouldn't deliver equally interesting results when implemented to other targeted areas, or even other types of investigation.
Perhaps our greatest find was that we discovered more in the data than we could even think of. It actually suggested stories and issues we had not considered in our previous analyses of these states and communities. Though in this proof of concept research we concentrated on detecting correlations with existing statistics, if carried out with big enough samples, this kind of approach clearly opens a door to answering the more intriguing and bigger questions of the kind: what else is out there?