Barriers to Successful Search Engine Crawling

Posted by Chris on February 21, 2009 Filed Under SEO — No Comments ↓

Unless the search engine bots successfully crawl or spider the web page, it won’t be indexed and shown in search results.. Here are some of the technical elements that webmasters commonly use that can block the web pages from search engine crawlers.

Robots.txt File

This file is placed in the root directory of a website. Here we can block unwanted pages and folders from search engine spiders. Even the popular site may be ignored by major search engines if the content of robots.txt file are incorrect. The basic syntax for robots.txt file is

User-agent: *
Disallow:

In the above syntax ” * “denotes any robot like googlebot,yahooslurp,etc. and “Disallow:” without ‘/’ denotes that all the URLs in the site can be crawled and indexed. Even some talented webmasters will accidentally place ‘/’ after the “disallow” declaration which will in turn block all the bots from visiting the site. The incorrect declaration would be

User-agent: *
Disallow: /

Broken Links

Broken links are dead end links which prevent search engine bots further crawling the site and the linked-to-page may never be found by search engines. This would negatively impact the site rankings.

Server Downtime

There is no problem in occasional down timing. But if the site is down repeatedly and inaccessible to crawlers, the site may get dropped in search engine rankings.

Image links

There is slight disadvantage using the image links. We should not forget to put ‘alt’ tags to images, so that it would help search engines to index it under the given text in the ‘alt’ tag. They need a name or keyword under which the content to be registered.

Orphan pages

Search engine spiders can not follow orphan pages of the site. A page is said to be orphan page If it is not linked from any other page in the website. Since the spiders crawl the the webpages by following the link pointing to the page, they can not follow orphaned pages.

Links inside images in Flash page

Even though Google has improved their ability to index Flash pages, they will not index the content within the images of Flash. It is better to avoid using Flash for the entire site till the full impact is yet to be known.

A Simple Check

To ensure all the links in a page are visible to the search engine spiders, we may have a simple check by viewing the text only version of cached snapshot of the page using Google. This will show all the links that are crawlable by search engines. The text only version will show the page with images off, so that we can verify whether all links behind images and menus are visible to the search engines.If the site is new or not yet indexed, we can use text only browser Lynx to verify.

Crawl errors

We can find the crawl errors of the site using the Google webmaster tools. It provides the wealth of information on the visibility of our site. If the URLs not found, Unreachable, timing out or restricted by robots.txt all are listed with respective URLs. It will help webmasters to track down easily the cause of errors and fix them.

Disclaimer: The post is completely based on individual thoughts and SEO Services Group bears no responsibilities for the thoughts reflected in the post.

Tweet this post

Filed Under SEO

Barriers to Successful Search Engine Crawling

Related Posts

Leave a Reply Cancel reply

Categories