Scrapers Target Online Forums
October 29, 2010
Your average Internet user leaves pieces of his or her life scattered all over the Web. Posts to social networks, comments on discussion boards, and reviews of products on Amazon are just a few ways that we leave our fingerprints on the Web as we use sites, fingerprints that could be used to collect a significant dossier on our habits as consumers, voters, parents, and — as an article in The Wall Street Journal recently revealed — patients.
Marketers have always striven to collect information about potential customers to better target pitches. Monitoring comments on blogs, Twitter, and even Facebook groups has become de rigueur, part of the whole social media marketing slog. But now marketers and the software companies that serve them are starting to take the practice of such information collection to another level.
Some of these approaches use tools that are as sophisticated as those used by government agencies with three-letter initials to collect and analyze "open source intelligence" on adversaries and potential threats.
One of these techniques is referred to by some as "scraping" — essentially capturing all of the content on a site and storing it in a database for analysis and review. Scraping technology itself has been around for a long time: Ask anyone who’s had the content of their site cloned by a robot spam blog. What’s new is the type of data being targeted and the level of analytics being applied to it once it’s captured.
Take, for example, the case cited by The Wall Street Journal: Web software from Nielsen Co. connected to and copied much of the content from a members-only discussion area. The discussion, which was for people coping with mood disorders such as depression and bipolar disorder, was on PatientsLikeMe, a social network for people to connect with those having medical issues like their own, and the content was highly personal. Some of it included links to users’ personal blogs and social networking profiles.
Nielsen’s software is BuzzMetrics, a market intelligence application that is targeted at what Nielsen calls consumer-generated media (CGM). BuzzMetrics, according to Nielsen, "uncover[s] and integrate[s] data-driven insights culled from nearly 100 million blogs, social networks, groups, boards and other CGM platforms." The service is focused on measuring sentiment about brands and products and "listening to unaided consumer conversations" — that is, conversations like those on PatientsLikeMe — to measure market sentiment.
Nielsen may be the biggest-known name playing the scraping game, but it certainly isn’t alone. Buzzient, a startup born from a research collaboration with Google (Nasdaq: GOOG) at the Center for Digital Business at MIT, has its own "social media analytics" system, with patents pending for software that scrapes content from social networks and other sources. Buzzient integrates into customer relationship management systems like Oracle CRM, mining social media sources for lead generation.
Just how do BuzzMetrics and other services from the growing number of scraping startups get to data that, while not necessarily private, isn’t public either? In the case of PatientsLikeMe, it was likely through the creation of an account on the site, which was then used to bulk-copy all of the site discussions. There are other, subtler ways of monitoring sites, such as connecting to RSS feeds for discussions or simply doing change detection on the content of a discussion site or of a Facebook or MySpace group. And then there are the pipes that the social networks themselves provide.
These tools aren’t just a challenge to an individual’s privacy. They also pose a problem for site owners who want to provide privacy to users while allowing new users to join easily and controlling how marketers engage their users. The scraping services may constitute a violation of terms of service for some sites, and their tactics often resemble a denial-of-service attack or a security exploit.
Still, the tools are just a further evolution of the same marketing data collection systems that have been used since before the Web opened to commercial use, and they’re not going away. Maybe the only way to keep them from probing into your online life is to do a different kind of scraping of your own — some thoughtful self-editing of what you post on the Web, and redaction of any cross-references among your online profiles.