Monday, October 27, 2008

Can the Web predict the next president?

Can the Web predict the next president?

Analysis of Web traffic and search patterns shows Obama's site more popular than McCain's
By Thomas A. Powell, Network World Lab Alliance and Joe Lima , Network World ,

IT professionals have historically monitored network traffic patterns to better understand network usage, to expose security events, and to generally promote overall network health. Traffic analysis can likewise be applied to the Web to understand a wide range of behavior patterns ranging from social media networks to suggestion systems in e-commerce to even the current hot topic: the presidential race.

What we found in our analysis of both rudimentary (such as tracking campaign site visits and domain registration tallies) and more complex traffic-tracking mechanisms (such as search tallies and online trading trends) applied across all other Internet segments, is that online traffic patterns are leaning – not unlike traditional polling data -- in the direction of Sen. Barack Obama. (See a slideshow explaining eight ways technology has shaped the elections.)

To assess the validity of how traffic-analysis techniques could provide any insight into the election, we first examined the most basic measurements available: campaign site visits and domain registration and moved to more content-focused metrics such as blog mentions and social network links.

All the data discussed in this article is provided by live URLs so readers can view the content themselves to tie into the most up-to-date information. Our intent is not to judge either the McCain or Obama campaigns or the candidates themselves, but rather to examine how the election is shaping up online and delve into whether the data provides any insight into the eventual outcome.

Traffic trends

The usage of the two campaign Web sites ( and can be tracked like any other large Web site via services like Hitwise, Alexa, Compete and Google Trends, to name a few. Overall, traffic to the campaign Web sites shows very clear trends regardless of data source.

Hitwise shows a consistent 2-to-1 advantage in unique site visitors for Obama's official campaign site in a head-to-head comparison, from August through early October -- with the exception of a significant narrowing of that gap around the week of the Republican National Convention.

Comparing the two campaign sites against other popular sites on the Internet tells a similar story. ranked both campaign sites in the top 500 Web sites in the United States for September — with the Obama site significantly more popular at No. 186 than the McCain site at No. 384, even though the latter has made up tremendous ground over the past year.

Similarly, numbers from Compete show the Obama campaign site serving about 64% of the total unique visitors to both campaign sites for the year ending in September. The graph shows that the Obama site has managed to sustain a large early lead built up during the protracted Democratic nomination contest, despite rapid growth by the McCain site as the general election campaign itself intensified over the summer.

Quantcast had rankings similair to Compete's, with the Obama site sitting at #115 with around 7.9 million visitors, compared with the McCain site at No. 272 with around 4.3 million users (this matches with Compete's traffic split as well) . These Quantcast traffic trends also showed a "pinch" at the time of the Republican convention in an otherwise wide and relatively persistent gap. One additional point of interest in the Quantcast data is a noticeable widening of Obama's normal (pre-convention) advantage, starting sometime around the middle of September. Quantcast, at least, appears to show Obama pulling away.

The popular Alexa traffic tracking service tells a very similar tale. The Alexa service shows the McCain campaign site with a traffic rank overall on the Internet of 3,074 while the Obama site is ranked 869 overall. Comparing the two sites' traffic patterns using Alexa shows the same general trend, with the now familiar convention-related pinch and an otherwise sizable gap in Obama's favor. The Alexa data also – like the Quantcast – shows a discernable, widening of that gap starting around mid-September.

Sampling bias can be a concern in any measurement. While Alexa and Hitwise do have very large samples that would help guard against bias, Google provides an even larger pool of users. Using Google Site Trends we see once again the gap between the usage of the candidate sites

A deeper drill down into this Google-based trending reveals some interesting points. First, we see that when looking at the data on a day-to-day basis, there is a clear pinch in the trend lines on Sept. 2 that is normalized out in the month-to-month view. This validates what we saw earlier in larger graphs when the Republican convention was finally in full swing.

Google data also shows that much of this traffic comes from the large population states. The lists below show the top 10 states, by traffic, for each candidate's site. It is interesting to see that some of the so-called "battleground" states (Florida, Virginia, Pennsylvania, Colorado) are represented at different positions in the two lists. What is most notable, however, is that you can see a consistent lead for Obama (represented by the blue horizontal bars) regardless of which candidate's top-10 ranking you use. That is, his site leads not only in all 10 of the states in which he does best, but also in the McCain site's 10 highest-traffic states as well.

Raw traffic to a site may of course come from those against a candidate, as well as from supporters. However, features like "users also visited" on the sites like Alexa, Google and Quantcast suggest that this is likely not a major factor in the traffic rankings. Instead, they indicate that visitors to a given candidate's site also tend to spend time on other sites with political orientations similar to the candidate's own. Thus, visitors to Obama's site tend also to visit popular liberal/progressive sites, while visitors to McCain's site tend to favor well-known conservative sites. If the opposite traffic effect were sizable, there would likely be a more mixed list of also-visited sites.

Searching for answers

Search term frequency is obviously a useful barometer of online interest. How does it fare as a gauge of actual candidate support? The fact that a candidate's name is being searched on can indicate many different attitudes towards that candidate -- from avid support to mere curiosity to outright opposition.

This expectation turns out to be at least partly validated by the numbers. If we use Google's Insights for Search feature to look at the popularity of each presidential and vice presidential candidate's name as a search term, the results are largely similar to other results seen in this analysis, but with one very notable exception -- Sarah Palin queries.

For the terms "Obama" and "McCain" we see the same relatively sustained advantage for Obama that shows up in most of the traffic numbers. That said, from the point of McCain's announcement of her as his running mate, the search term "Palin" spikes far above the others in popularity -- and holds that lead throughout the better part of September, returning consistently below "Obama" and "McCain" only into October.

This is perhaps not so surprising if we consider the context: Palin is the only one of the four candidates who was largely new to the national political scene until her introduction to it precisely during the time period being studied here. It stands to reason that many online politics watchers (including presumably a good many prospective voters) would use the Internet to inform themselves about the newest name on the national stage.

Because of this context, we speculate that the gradual ebbing of searches on Palin's name to levels more in line with those of the other candidates reflects not necessarily particular pro or con judgments, but simply her growing familiarity to most Web users interested in the race. That said, it is notable that she remains, so far as Google searches are concerned at least, a figure of considerably more online interest than Obama's running mate, Joe Biden.

Links are recommendations?

Search engines often focus on inbound links to understand relevance and popularity of a site, page or piece of content. So we applied the same type of thinking to the candidate sites. As an example Eric Miraglia's Page Inlink Analyzer (available at his blog) uses Yahoo and Delicious APIs to provide a powerful tool for examining a site's inbound links. If we apply this metric to the two campaign sites, we see that the different traffic patterns we have noted so far are reflected in and no doubt sustained by similar differences in link popularity. Once again, Obama's site (with over 1.7 million inbound links) maintains about a 2-to-1 overall advantage over McCain's (with about 0.8 million):

Using a similar type of simple approach within the Google realm to measure links in its site index, the results for McCain's links and Obama's links show more links for Obama, but we must note that overall volume here is much lower.

In order to take into account the possibility of link tainting, we observed what Google indexed for each site. The site index footprint for McCain's site has a whopping 30,000 URLs recorded to Obama's sub-3,000 level. So the inbound link vote is certainly not a function of the site size index wise, but rather appears to be a legitimate view of outside interest with Obama again having more votes by inbound link. However, is this a result of effective promotion and grass roots organization or does it show real interest?

Online prediction

Finally, we looked at online trading. Intrade, the online prediction outfit used for non-sports-related events has offered contracts for the major candidates throughout the campaign season. These contracts trade at $0.10 per point, in a range of 1-100 points. Because prices are limited to that range, a price of 50 points for a candidate's contract indicates that the market is estimating a 50% probability of that candidate winning the race.

Interestingly, if we examine the trends in daily closing prices for the candidates' respective contracts, we find them mapping reasonably well onto the broad measures of popularity that we found when looking at traffic to their respective campaign Web sites. Obama's contract was comfortably above 60 points for much of the summer. It then fell below 50 points (50% probability) for the first time in the week following the Republican convention. During that same week, McCain's contract rose above 50 points for the first time in the campaign. Then, just as quickly as they had converged, the trends lines separated again, with the Obama contract rising, and the McCain contract falling, from mid-September on.

These developments obviously correspond quite well to what we noted in most of the site traffic data — reflecting both Obama's summer-long advantage over McCain, as well as the temporary, convention-related convergence in their popularity in early September (the "pinch").

Moreover, the online prediction market seems to reflect the same more-recent trend noted with Quantcast's and Alexa's traffic numbers: a sharp increase, starting from about the middle of September, in Obama's typical advantage over McCain, with the Obama contract rising over 70 points for the first time in early October, just as McCain's contract was falling below 30 points for the first time since mid-July. Following the third Presidential debate (Oct. 15) Obama's Intrade contract was trading at a new high of over 80 points, and McCain's at a new low of below 20.

Rigging the system?

Even casual Web users believe vote stuffing is common in simple online polls, as most simple Web polls can be defeated by users clearing cookies, purging Flash settings, coming from other IP addresses, crowd sourcing votes, and even using bots. Likewise, some of the more focused systems exhibit sampling bias. For example, we did not include Amazon data for book ratings or levels because of what appeared to be clear manipulation. We saw similar effects in promotion of articles across social networks. Most interestingly is that Intrade, while mostly matching the observed data elsewhere, can be gamed as a single investor drove up McCain contract values at one point in the campaign.

Our point is that using the Web requires some careful consideration of the size of sample and ease of manipulation. We do not believe that for what was selected as the primary focus of the article that rigging is a concern. The numbers used for correlation with the larger scale systems such as Alexa, Google, and Hitwise are is just too large to be easily gamed.

Online vs. offline measurements

It's clear that the Web-based measurements of candidate popularity have a consistent story to tell, albeit with some variations depending on the type and source and of the data. This raises the obvious question of how well what we can measure online matches up to what is going on in the world at large. Do Web-based measures, as a whole, contain a sampling bias? Does the Internet make its own waves, so to speak? Does it have its own ebb and flow of opinion, or does it more-or-less reflect what is happening in the broader society?

Answering those questions in a detailed way is beyond the scope of our article. We've given some indication of which online measures we consider perhaps more reliable because they are less likely to be tainted by demographic sampling bias. Much more definite conclusions than that we'll leave to analysts with more time to crunch the data.

We can however at least sketch the relationship between online and offline opinion by comparing some of the more reliable Web-based measures of candidate popularity with a more traditional metric of popular opinion. For the latter, a common choice is the Gallup Daily Tracking Poll.

The Gallup Daily tracker is a three-day moving average. We took this snapshot on the morning of Oct. 20, so that it would cover interviews from Oct. 16-18 -- the first three days after the third and final presidential debate.

Right away, we are struck by the fact that the poll portrays a much closer race overall than do most of the online measures of candidate popularity. For instance, Gallup shows McCain in a virtual tie with Obama in the second half of August — something none of the online measures depicts. Apparently, there is still some general sampling bias in using the universe of Internet users as representatives of the population at large. If Gallup is right, the online measures as a whole appear to overstate the degree of support for Obama relative to McCain.

If we look a little more closely, however, we do start to see some interesting correlations between the online and offline metrics. For example, after regaining his more-typical lead at the very end of August, Obama then was losing it again on Sept. 7, with McCain's advantage widening to 5% over the next couple of days. This, of course, is our familiar "pinch" in the trend lines, which we encountered in almost all of the site traffic metrics, and which immediately follows the Republican convention in the first week of September. In the Gallup poll, to be sure, it is a crossing of the lines rather than a pinch, since the gap between the candidates was never as wide as it was in the Web-based metrics.

Finally, the same surge in support for Obama after the middle of September -- that we noted in the Quantcast and Alexa numbers as well as in the Intrade markets -- shows up clearly in the Gallup poll. With the exception of a brief tie on Sept. 25, Gallup's story since mid-September has been one of Obama pulling away, reaching a new campaign high of 11% on Oct. 8.

In short, despite an apparent sampling bias that significantly exaggerates Obama's advantages when using the online numbers to infer candidate support in the general population, it certainly looks as if both online and offline metrics are reflecting real trends in underlying popular opinion, with common causal bases. The strongest conclusion we are prepared to offer is that the Web can be used by technically savvy individuals to obtain direct, detailed insight into real campaign trends. The first Tuesday in November will tell us how relevant such insight is to the actual outcome.

No comments: