
Search Engine Spider Traps, Oh My!
It shouldn’t take more than 48 hours to first pick up notice from a bot, aka spider. Yahoo claims to scour the web every 48 hours. Google says it could be up to two weeks before one of its crawlers can get to you. You can speed up the process (or so web lore claims) by submitting a site map to Google, Yahoo and Inktomi, in effect, providing an invitation to crawl and notification that your site at least exists.
However, despite repeated invitations, you’re still invisible to all search engines – even the ones to which you submitted the URL according to the search engine’s wishes. The problem may not be that search engines avoid you. In fact, they can send crawlers daily. But if those bots get trapped on your site, they never gather data and report it for indexing.
Bots aren’t bright. They often get stuck in distant corners of a site. Sometimes, they can’t even get past the homepage! Trapped like a spider in a mason jar. No wonder your site hasn’t received the search engine recognition it so richly deserves.
Bot Traps and How To Avoid Creating Them
Bots follow links wherever they may lead. This permits these bits of program to move around with some direction rather than bouncing from one page to another, from one site to another. So, by using embedded text links in body text, you direct the crawling activities of bots.
These same links may also lead a bot into a trap from which (sniff!) there is no escape. So, when developing intra-site linkage, remember that bots may end up where you don’t want them. Let’s look at some common bot traps and how to avoid trapping crawlers on site.
Robot Speak
In many cases, you can redirect spiders from specific site pages so they avoid the pitfall. Using HTML Robot tags, a programmer can direct spiders away with a Do Not Disturb sign on the door.
A robot meta tag defines the path a bot can and can not take. Off limit or restricted access pages are sometimes called ‘arguments’ because, in fact, bots want to crawl everything. It’s their raison d’etre.
This command is recognized by all major search engines. It’s used to tell bots NOT to index a page and, so, all pages with a noindex command are left unindexed.
tells bots to ignore all links that appear on that page. This is important because it’s an effective means of directing spiders away from traps.
informs the bots not to use dmoz.org to generate title tags for individual pages. In other words, don’t send another bot to classify site content. This is critical since the Open Directory Project serves as Google’s default directory. Yahoo also employs elements of dmoz.org to supplement its own, proprietary directory. As such, this robot meta tag applies to Google and Yahoo.
only applies to Yahoo, telling bots not to use the Yahoo Directory to generate title tags.
only applies to Google which, unless told otherwise, will generate description tags based on site text. Again, by defining acceptable practices for spiders, you increase control over how your site is indexed. Knowledge is power.
is recognized by all search engines. Pages identified as noarchive will not be cached, and therefore, will not appear in the cache view offered by Google, Yahoo, Live, Ask and other popular search engines.
Through the judicious and calculated use of robot meta data, you define what bots see and don’t see, and how what gets spidered can and can NOT be employed by an SE index.
Parenthetically, some SEO pros believe that overusing robot meta tags raises suspicions on the part of bots, and indeed, these meta data are used by unscrupulous site owners to ward off spiders and subvert the relevance of SERPs.
Any SEO practice can be overdone and quickly detected by algorithm-driven alarm bells. However, using these directions prevents spiders from falling into unintended traps throughout your site.
Log-Ins
Pages that require a log in (user names and password) can easily stymie a crawler. The bot may be able to enter the closed door without finding another link to the outside web. In fact, much of this “keyword protected” content may never be indexed – and this is the meat and potatoes of the site.
If the log-in appears on the home page, this may limit crawler access to the rest of the site, and putting a robot command on the home page is not good SEO practice.
HTML Frames
Frames are design elements that enable site developers to display more than one web page in users’ browsers simultaneously. There are vertical framesets and horizontal frame sets identified by the tag. Framesets are used to define a set of rows in the case of horizontal framesets and columns in vertical framesets. Values determined by the programmer define the actual size of the frame that appears on the site’s presentation layer.
Frames are used by designers to create web pages that contain a great deal of information with links to deep site locations. Thus, the frame attracts visitor attention and encourages drilling down deeper into the site.
Bots can become trapped in frames, which are often “dead-ends” – not to humans but to bots. Visitors won’t necessarily interact with site frames. Crawlers will, and in that case, they enter but never leave – and this page remains unindexed.
Cookie-Restricted Pages
When you visit a website, you pick up a cookie – a short burst of code that contains on-site activity, “remember my name on this computer” information and other “you-based” data.
When visitors to your site show up, you may deposit your own cookie in a jar – a cookie that allows access to some pages but not to the “for-pay-password-protected” content. Again, these are one-way, dead ends for spiders who can get in but may not find a link out.
URL Session IDs
Totally confusing to spiders and, in fact, including session IDs as part of the URL may actually hurt you in rankings. How?
Each time the site is spidered a new URL is generated for that session. Each time, the bot indexes the new URL, which contains the same content as the URL with an earlier session number, your site is slammed for duplicate content. In fact, the inclusion of session IDs in the URL creates site entropy – ongoing, self-perpetuating disintegration until the site reaches inertia and stops moving at all.
Session IDs in URLs not only traps spiders in a tangled web of what appears to be repetitious content, each time the site is indexed, the same complaint draws the same conclusion: lower and lower page rank. In this case, consider yourself lucky if the bot becomes site-bound. At least you aren’t losing ground.
Can Your Site Be Saved?
No problem.
Google offers complete diagnostics as part of its Webmaster Tools features. You can view your site the way Googlebots see it. Google will provide detailed stats on crawling activities over the lifetime of the website, and surprisingly, many sites have not been completely indexed because of errors detected by bots but undetected by site owners. If you haven’t run these analyses on your site, take some time today to do that.
Log on to your Google account, go to Webmaster Tools and click on Diagnostics. You’ll see an overview of crawling activity, a list of errors and problems encountered by Googlebots (including the date the problem was first detected). Google also provides the latest results of its Content Analysis and Mobile Crawl for content intended specifically for use on mobile cell phones.
You May Not Even Know What’s Wrong Until You Ask
You can read all the SEO blogs, hang out at SEO bars and spend the day tweaking your site, all the time scratching your head over why your site hasn’t been indexed. Or completely indexed. You want to know why all of your site promotion has led nowhere.
It may be as simple as an undetected, unintended spider trap. Your site sees bots but they become trapped in frames or log-ins. The use and positioning of these commercial site staples may well keep your site invisible to search engines and to all of those potential visitors who use search engines.
That’s all of us.
Posted by webwordslinger 
Posted by webwordslinger 
Posted by webwordslinger 



















