Feb 26th 2010
Content filtering for Australia: Can it work?
Internet entrepreneur Martin Rushe has looked into the practicalities of filtering content over the internet. He found that not even the People’s Republic of China, with a team of 30,000 people and unlimited resources could make it work.
On a recent trip to China I stopped in at several stadium-sized internet cafes to run some tests. I wanted to see for myself The Great Firewall of China in action, the infamous content filtering system implemented by the People’s Republic under the more benign title of The Golden Shield.
I had been told by friends that upwards of 30,000 people worked on the system, a reasonably sized IT department by anyone’s estimation. I had also watched with interest when Cisco, HP and Google were hauled up in front of the US Congress to explain, amongst other things, their selling of weapons grade internet technology to China. So clearly no shortage of expertise. And one can only imagine cash was not a problem for the project.
I furtively typed my list of keywords (Tiananmen Square, Falun Gong, Taiwan etc) into a clutch of search engines and waited to jot down the response. The results, when they came, were predictable.
More than 100,000 uninterrupted and well informed pages of content on each topic were available. Graphic pictures, anti-government sentiment, swingeing satire and all.
Hardly a controlled experiment. But at the very least it is strong anecdotal evidence supporting a simple catch-cry which many in the Australian IT community have trumpeted since Senator Conroy’s first white paper: Content Filtering is very hard.
Let’s put aside for a moment the obvious moral questions of censorship and explore this simply as an engineering problem. Is Senator Conroy’s stated aim possible? And if so, at what cost to the tax payer and to the Internet user?
To do that let us first take a look at some of the basics of Content Filtering.
We must begin by asking what it is we wish to filter? The internet carries a host of different traffic types. Ask a network engineer what passes through his router on any given day and (if it’s not a personal question) he’ll tell you the list of content types is long and growing. My understanding is the present intention is to filter HTTP traffic or web pages. Makes sense, since web use is the dominant and most readily accessible (think browsing by phone) form of Internet content.
And that’s where he hit road block number one. The content which Senator Conroy most ostensibly wishes to filter, namely obscene pornography, is already illegal in most countries and therefore not accessed over the public web. It is shared over private peer-to-peer networks and via direct communication, such as email attachments. Neither of which can be inspected by real-time content filtering systems. Of course, the Government could open emails and peer inside, but I suspect even the strident Senator Conroy would avoid that public privacy debate.
So if we limit ourselves to filtering web traffic, how do we intend to do it? Naturally there are a thousand ways to skin the cyber-cat but in the main they fall into three camps; White List, Black List and Value Based. In each an administrator is required to implement a policy of what should and should not be filtered. But that strays into the moral question.
White Lists are simple. You make a list of all the web addresses (URLs) people are allowed to access and limit them to that. Foolproof in terms of restricting access to URLs but it does require you to monitor the content of those web sites. For example, you allow Facebook but have you checked everyone’s profile to make sure no nasties lurk within? And how often are you checking, because profiles change?
Blacklists are equally simple and equally flawed. With a Blacklist you specify all the URLs people are not allowed to access. Draconian but effective. Until the people whose content you are blocking see you are blocking it and move to a different URL. Its the cyber equivalent of Police moving undesirables down the street.
The Value Based systems are more complex. They actually look at the content and, using a set of policies, take a view on the contents’ nature and thus appropriateness. An example is keyword scanning, where the content filtering system reads the words on a web page and apportions values to undesirable words (to ensure this article passes through value based content filters I haven’t listed the obvious candidates for undesirable keywords but you can use your imagination). Add up the number of offensive keywords and if your web page breaches the threshold it can’t come in.
But what about pictures? No words in pictures. Well, keyword filtering can’t deal with them. Systems exist which scan for flesh tones and take a view on how much flesh is too much flesh. It sounds marginal and it is. You risk taking the Louvre’s excellent online presence offline along with every Dermatologists web site. And flesh is such a difficult commodity to qualify. Too much knee may not offend, but too much, well, other parts, might. And the provenance of the knee is critical, to whom does it belong and how old is it?
Most probably any system implemented at a national level will contain a combination of these methods and others beside. But combine the obvious challenges these methods face and suddenly China’s 30,000 strong Internet Police start to look understaffed.


Comments