The Over-Extended Link Engine
Imagine yourself entering the Tour de France. You have trained hard. Every day you rode your bicycle, worked out beyond your pain limit, and spent a =fortune in getting to the top of the list. As you approach the start line, everything changes. Judges suddenly announce that a 16 member team from Mars will be allowed to compete in the race and as we all know, Martians have 3 legs and ride Thricycles. You jump at the start gun's firing but the Martians cover the entire course in 27 minutes. You find yourself a hot contender for 8th place in the span of seconds. With any intelligence you give up and protest.
This is exactly what has happened to our Search Engine Results Pages. With little exception, spammers controlling infected bots have overtaken all former rank challengers and will soon dominate web results. Perhaps even more frightening - they will soon start creating their own results, inadvertently.
As far as most users are concerned, what Google does is a bit of a mystery when it comes to page ranking, but for SEOs most of the equations are known. Time itself is a large factor when it comes to getting moved up the charts, as well as naming, keywords, keeping things "honest" and backlinking. Or so you might think.
A number of spam attacks have been occurring as of late on the 'personal journal' websites, such as Blogger, Spaces Live and AOL Journals. This would normally be of little concern other than the attackers are able to use compromised PCs to create accounts which bypass or solve the CAPTCHA method of authentication (that's those silly images with warped words in them which you have to type). Originally these attacks were somewhat low on the scale of importance. They did little to affect the general internet structure. Websense's capture of one of these attacks shows how the bots are instructed to get in and post regularly. Success rate is about 10%, which is more than acceptable for a hijacking.
However, things have changed, for the worse. These spambots have now been re-engineered into link farming rankbots. They now hold thousands of blogging and personal accounts and each one is chock full of backlinks to every other relative rankbot site. These spamblogs are filled to the brim with pretty keywords for the hot products being offered. This as it turns out, used to simply be things that were on google's trendlist for the day. If you aren't familiar with Google Trends it's basically a list of what the hottest search terms are. These rankbots are able to quickly create several hundred pages, each on a different site and account, all referring to the keyword in question with very simple formulas. (too simple in fact, exposing major flaws in Google's search armory). Now many such searches will show the top ranked blogs as being pure rankbots with tons of garbage text. Suddenly the voices of millions of other bloggers have been drowned out by the sound of spam.
As if that wasn't enough, I was recently treated to watching firsthand one of these bot sites get so highly ranked that it was number 3 on Google's first page of web results for a recent large event in Paris. Google has so over-tweaked the power of a backlink that a domain name like http://8u8ehfjsgsi.spaces.live.com can be number one for the hottest search terms - beating out real contributors, large companies and leaving the word of personal spaces and blogs in the dust. On a recent scan of the top ten hottest terms, 6 were overtaken by rankbots in the blogosphere, and 1 had a top 3 ranking in google's Web Results. Now that's some serious blowout, but it's just the tip of the iceberg!
One keyword in the top 20 hottest terms each day will be something you cannot decipher. This word could be something like the recent "Prolix Ungulate" attack. What am I talking about? These spamdexing botnets have overtaken the ranking so completely that garbage text they generate becomes the hottest keyword of the day. Imagine that all the searches performed on Google in a day (200 million if I remember correctly), do not add up to the spam generated by hacked bots blurting out noise into the blogosphere. If permitted to persist, these rankbots will own every keyword in Google's engine. I know this sounds ridiculous but it's not. Sources from Google indicate that work is being done to address the issue, but it has technically been an open issue since 2005 when splogging first began.
Here are a few of the exposed flaws:
Overranking of backlinks: Much like IMDB or Wikipedia, these rankbots have links upon links to each other in a very google-friendly format without overextending themselves. If you notice that many of your everyday searches might first turn up a wikipedia entry before they show you the actual product you searched for, that is due to the amount of times users have linked into the page at wiki, and vice versa. Rankbots expose this anomaly very cleanly. Google especialy enjoys links in the form of "Phraseologist" rather than http://www.phraseologist.com/, which rankbots also exploit greatly to their advantage.
Overranking of personal blogs: In order to raise preference for orgs like Wikipedia, Google has to give less weight to the Domain-Name of a link and its content. In this manner, search for most common or news terms will typically shows results from a Wiki first before the actual property owner. This is because a link on an indexed page causes a huge uptick in rank. Allowing personal spaces to dominate such links means that corporations linking to each other in a partnership, like HP-EDS, show up low in results compared to others covering the event. these spam blogs exploit this by understanding that once they are indexed, they have the same leverage as a dot-com domain with a name matching the hot term. So they can instantly jump to the top of the results index with no penalty for being a personal blog.
De-ranking of domain-names: For a long time, a domain name was a key to any rankable site. Just as HP.com should rank first for HP searches, freakpizza.com would overtake most contenders for the "freak pizza" search results. But this is no longer the case. Dot COMs still have some reasonable weight, especially without a dash, and .NETs, .ORGs and even .INFOs still hold their own in the mighty domain world. But their presence is greatly diminished. Search for an item of interest and you might find only 1 keyword-based domain name which matches your search. Most of this rank is now by content and backlinks at stated above, and it is probably overly repressive.
Timing - rankbots are posting to their respective blogs every 12-24 hours. This must be an exploit of time-weighted posts. If someone posts every 60 minutes, they may be given less weight over time or raise a flag. By posting once per day or so, the bots seem to maximize this weight of time effect without diminishing returns or penalties bringing the page to the forefront of suspicion. Also, each page in a link farm posts its link to the other pages at different times, making it appears as if the "hot news" is spreading across multiple users slowly.
Permitted Relinkage - The link to the spyware which the bot wants the user to click on is typically repeated several times over the course of the page. Google appears to be ignoring this and the bots are reveling in the ease of linkage. Even the outside link to the image is being repeated on the page with no ill effect.
Break Spacing - it is unusual for a web page or even a personal blogger to use 70 "break" statements in an html line, but apparently google is not seeing this as a page exploit either. The goal of the extended breaks is to make the end user think that he/she is seeing the only post on the page.
Administrative Overload - After noticing a spam blog on PageOne search results in rank 3, I notified google through the administraive webmaster tools panel. It took 3 days for someone to answer my request that the page be removed. Certainly google must be overwhelmed with such requests as a general rule, and these specificly more than likely require even more investigative research before removal. By the time Google removed the indexed result, two other web results had infected slots 19 and 20.
It seems the new generation of rankbot spam is overtaking our industry, replacing the real world with the unknown. If nothing were done about it, we would have pages upon pages of rankbot results to filter through with no end. This crisis resembles junk email in a very real way. At first you don't care if you get a single piece of junk, but all of the sudden one day you are overtaken by Viagra spam and you can't stop it - because all the providers ignored the "small" problem for too long. These innocuous exploits are becoming more commonplace and will soon break through the barrier, with a massive Martian noise.
Google is working hard to correct this problem but I suspect that there are now new political struggles overtaking the giant in its quest to balance personal life with the real world. Users and their content have been overweighed in an effort to create a blogo-tubular-universe of googlisms. The beast feeds upon itself and its message, in turn creating a new genre which cannot be steered from the helm any longer.
This does make some sense. Google is very focused on user content, especially in ways that relate to youtube, and may have been under great pressure to rank user content high on the webscale. However this has backfired a bit too much. You will get youtube results for just about anything with a link to any page in the universe, and it is discreditable. Search engine optimization has turned a cheek to this influx of over-personalization. Without an anchor of some title or type, such results should seriously be considered and treated as junk. Yes, I know that there are users out there who are inventing new words and cool fun things to laugh at every day - but if 100 of them each link to each other with no real world link (real world being nytimes, cnn, wikipedia, etc), then it is just more brabble from the gallery. Such anchors are necessary, perhaps not for a blog rank but certainly for a webrank. And domain names do need to hold some weight. A serp rank should not be so easily given to a random AOL Journal page.
Also what has been made very clear is that CAPTCHA is not an effective inoculation against the disease of bots. However if we can manage the symptoms before they cause serious illness we might pull through for the time being. In my next article I will try to address the disease itself, authentication.
This is exactly what has happened to our Search Engine Results Pages. With little exception, spammers controlling infected bots have overtaken all former rank challengers and will soon dominate web results. Perhaps even more frightening - they will soon start creating their own results, inadvertently.
As far as most users are concerned, what Google does is a bit of a mystery when it comes to page ranking, but for SEOs most of the equations are known. Time itself is a large factor when it comes to getting moved up the charts, as well as naming, keywords, keeping things "honest" and backlinking. Or so you might think.
A number of spam attacks have been occurring as of late on the 'personal journal' websites, such as Blogger, Spaces Live and AOL Journals. This would normally be of little concern other than the attackers are able to use compromised PCs to create accounts which bypass or solve the CAPTCHA method of authentication (that's those silly images with warped words in them which you have to type). Originally these attacks were somewhat low on the scale of importance. They did little to affect the general internet structure. Websense's capture of one of these attacks shows how the bots are instructed to get in and post regularly. Success rate is about 10%, which is more than acceptable for a hijacking.
However, things have changed, for the worse. These spambots have now been re-engineered into link farming rankbots. They now hold thousands of blogging and personal accounts and each one is chock full of backlinks to every other relative rankbot site. These spamblogs are filled to the brim with pretty keywords for the hot products being offered. This as it turns out, used to simply be things that were on google's trendlist for the day. If you aren't familiar with Google Trends it's basically a list of what the hottest search terms are. These rankbots are able to quickly create several hundred pages, each on a different site and account, all referring to the keyword in question with very simple formulas. (too simple in fact, exposing major flaws in Google's search armory). Now many such searches will show the top ranked blogs as being pure rankbots with tons of garbage text. Suddenly the voices of millions of other bloggers have been drowned out by the sound of spam.
As if that wasn't enough, I was recently treated to watching firsthand one of these bot sites get so highly ranked that it was number 3 on Google's first page of web results for a recent large event in Paris. Google has so over-tweaked the power of a backlink that a domain name like http://8u8ehfjsgsi.spaces.live.com can be number one for the hottest search terms - beating out real contributors, large companies and leaving the word of personal spaces and blogs in the dust. On a recent scan of the top ten hottest terms, 6 were overtaken by rankbots in the blogosphere, and 1 had a top 3 ranking in google's Web Results. Now that's some serious blowout, but it's just the tip of the iceberg!
One keyword in the top 20 hottest terms each day will be something you cannot decipher. This word could be something like the recent "Prolix Ungulate" attack. What am I talking about? These spamdexing botnets have overtaken the ranking so completely that garbage text they generate becomes the hottest keyword of the day. Imagine that all the searches performed on Google in a day (200 million if I remember correctly), do not add up to the spam generated by hacked bots blurting out noise into the blogosphere. If permitted to persist, these rankbots will own every keyword in Google's engine. I know this sounds ridiculous but it's not. Sources from Google indicate that work is being done to address the issue, but it has technically been an open issue since 2005 when splogging first began.
Here are a few of the exposed flaws:
Overranking of backlinks: Much like IMDB or Wikipedia, these rankbots have links upon links to each other in a very google-friendly format without overextending themselves. If you notice that many of your everyday searches might first turn up a wikipedia entry before they show you the actual product you searched for, that is due to the amount of times users have linked into the page at wiki, and vice versa. Rankbots expose this anomaly very cleanly. Google especialy enjoys links in the form of "Phraseologist" rather than http://www.phraseologist.com/, which rankbots also exploit greatly to their advantage.
Overranking of personal blogs: In order to raise preference for orgs like Wikipedia, Google has to give less weight to the Domain-Name of a link and its content. In this manner, search for most common or news terms will typically shows results from a Wiki first before the actual property owner. This is because a link on an indexed page causes a huge uptick in rank. Allowing personal spaces to dominate such links means that corporations linking to each other in a partnership, like HP-EDS, show up low in results compared to others covering the event. these spam blogs exploit this by understanding that once they are indexed, they have the same leverage as a dot-com domain with a name matching the hot term. So they can instantly jump to the top of the results index with no penalty for being a personal blog.
De-ranking of domain-names: For a long time, a domain name was a key to any rankable site. Just as HP.com should rank first for HP searches, freakpizza.com would overtake most contenders for the "freak pizza" search results. But this is no longer the case. Dot COMs still have some reasonable weight, especially without a dash, and .NETs, .ORGs and even .INFOs still hold their own in the mighty domain world. But their presence is greatly diminished. Search for an item of interest and you might find only 1 keyword-based domain name which matches your search. Most of this rank is now by content and backlinks at stated above, and it is probably overly repressive.
Timing - rankbots are posting to their respective blogs every 12-24 hours. This must be an exploit of time-weighted posts. If someone posts every 60 minutes, they may be given less weight over time or raise a flag. By posting once per day or so, the bots seem to maximize this weight of time effect without diminishing returns or penalties bringing the page to the forefront of suspicion. Also, each page in a link farm posts its link to the other pages at different times, making it appears as if the "hot news" is spreading across multiple users slowly.
Permitted Relinkage - The link to the spyware which the bot wants the user to click on is typically repeated several times over the course of the page. Google appears to be ignoring this and the bots are reveling in the ease of linkage. Even the outside link to the image is being repeated on the page with no ill effect.
Break Spacing - it is unusual for a web page or even a personal blogger to use 70 "break" statements in an html line, but apparently google is not seeing this as a page exploit either. The goal of the extended breaks is to make the end user think that he/she is seeing the only post on the page.
Administrative Overload - After noticing a spam blog on PageOne search results in rank 3, I notified google through the administraive webmaster tools panel. It took 3 days for someone to answer my request that the page be removed. Certainly google must be overwhelmed with such requests as a general rule, and these specificly more than likely require even more investigative research before removal. By the time Google removed the indexed result, two other web results had infected slots 19 and 20.
It seems the new generation of rankbot spam is overtaking our industry, replacing the real world with the unknown. If nothing were done about it, we would have pages upon pages of rankbot results to filter through with no end. This crisis resembles junk email in a very real way. At first you don't care if you get a single piece of junk, but all of the sudden one day you are overtaken by Viagra spam and you can't stop it - because all the providers ignored the "small" problem for too long. These innocuous exploits are becoming more commonplace and will soon break through the barrier, with a massive Martian noise.
Google is working hard to correct this problem but I suspect that there are now new political struggles overtaking the giant in its quest to balance personal life with the real world. Users and their content have been overweighed in an effort to create a blogo-tubular-universe of googlisms. The beast feeds upon itself and its message, in turn creating a new genre which cannot be steered from the helm any longer.
This does make some sense. Google is very focused on user content, especially in ways that relate to youtube, and may have been under great pressure to rank user content high on the webscale. However this has backfired a bit too much. You will get youtube results for just about anything with a link to any page in the universe, and it is discreditable. Search engine optimization has turned a cheek to this influx of over-personalization. Without an anchor of some title or type, such results should seriously be considered and treated as junk. Yes, I know that there are users out there who are inventing new words and cool fun things to laugh at every day - but if 100 of them each link to each other with no real world link (real world being nytimes, cnn, wikipedia, etc), then it is just more brabble from the gallery. Such anchors are necessary, perhaps not for a blog rank but certainly for a webrank. And domain names do need to hold some weight. A serp rank should not be so easily given to a random AOL Journal page.
Also what has been made very clear is that CAPTCHA is not an effective inoculation against the disease of bots. However if we can manage the symptoms before they cause serious illness we might pull through for the time being. In my next article I will try to address the disease itself, authentication.
Labels: rank spam, rankbot, search engine spam, spam blog

0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home