Submit Link>>

Web crawlers Googlebot

A Web Crawler is a computer program that automatically browses the World Wide Web in a methodical way. Web Crawlers is also called ant, bot, worm or Web spider. The process of scanning the WWW is called Web crawling or spidering.

What Web Crawlers do?

Web Crawling is used by Search engines to provide up-to-date data to the users. What Web Crawlers essentially do is to create a copy of all the visited pages for later processing by a Search Engine. The search engine will then index the downloaded pages in order to provide fast searches.

Web Crawlers are also used for automating tasks on websites such as checking links or validating HTML code. A Web crawler usually starts with a list of URLs to visit (called the seeds). As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit (crawl frontier). URLs from the frontier are then recursively visited according to a set of policies.

Googlebot why your website Crawling ??

Googlebot is an example of crawler and it’s the Web Crowler used by Google to collect documents from the web to build a searchable index for the Google search engine. Indeed, Googlebot uses algorithmic software to determine which sites to crawl, how often, and how many pages to fetch from each site. In order to achieve this, Google uses a huge set of servers to “crawl” billions of website pages all over the web.

As a Web Crawler Googlebot begins with a list of webpage URLs (generated from previous crawl processes) and also uses information from the Sitemap files provided by webmasters. Googlebot then detects links on each visited page and adds them to its list of pages to crawl. During this process, new sites, changes to existing sites, and dead links are noted and used to update the Google index.

Googlebot maintains a massive index of all the words it sees and their location on each page. Additionally, it is able also to process information in content tags and attributes (i.e. Title tags and ALT attributes). Googlebot cannot process though all content types (i.e. it cannot process the content of some rich media files or dynamic pages). When a user enters a query, the Search Engine searches the index for matching pages and returns the most relevant results, determined by over 200 factors, as for example the PageRank for a given page. PageRank is the measure of the importance of a page and it’s based on the incoming links from other pages.

How to block Googlebot

If you want to block Googlebot (or other crawlers) from accessing and indexing the information on your website, you can add appropriate directives in your robots.txt file, or by simply adding a meta tag to your webpage. In particular, the requests from Googlebot to Web servers include a user-agent string “Googlebot” and host address “googlebot.com”.

Fetch as Googlebot

If you want to see what Googlebot sees when it accesses your page, Google has provided a “Fetch as Googlebot” feature that gives the ability for users to submit pages and get real time feedback on what Googlebot sees. “Fetch as Googlebot” comes useful if users re-implement their site or find out that some of their web pages have been hacked, or want to understand why they’re not ranking for specific keywords.

If you want to use ‘Fetch as Googlebot‘ all you have to do is:

1. Login to Webmaster Tools
2. Select your site
3. Go to Labs –> Fetch as Googlebot

Search engines key processes

In conclusion, the three key processes that Search engines need in order to deliver search results to users are:
1-Crawling: Does Google know about the existence of your website? Can Google find it?
2-Indexing: Can Google index your website?
3-Serving: Does your website have good quality and useful content that is relevant to potential user’s searches?
Currently, Googlebot follows HREF links and SRC links. There is increasing evidence Googlebot can execute JavaScript and parse content generated by Ajax calls as well. Googlebot discovers pages by harvesting all of the links on every page it finds. It then follows these links to other web pages. New web pages must be linked to other known pages on the web in order to be crawled and indexed or manually submitted by the webmaster.

Many generic web servers also support server-side scripting using Active Server Pages (ASP), PHP, or other scripting languages. This means that the behaviour of the web server can be scripted in separate files, while the actual server software remains unchanged. Usually, this function is used to create HTML documents dynamically ("on-the-fly") as opposed to returning static documents. The former is primarily used for retrieving and/or modifying information from databases. The latter is typically much faster and more easily cached.

Google Indexing Problems

This article contains useful tips and advice to solve common Google indexing and crawling problems. Firstly, it's common for a low ranking website to be included in the Google index without actually being visible in the search engine results (SERP).

You can check whether your website is indexed and cached by Google, using a simple query command. To accomplish this, type site:www.mydomain.com in a Google search window, replacing "mydomain" with your registered domain name. If Google returns the message: "sorry no information is available for the URL www.mydomain.com" then none of your website pages are Google indexed and you may have a Google indexing or crawling problem.

Check Google Cache

The Google site:www.mydomain.com query returns a list of all web pages in your domain which are indexed and cached in the Google index. If no web pages are indexed, this is often due the web domain being new or recently launched with not enough quality backlinks to make it into the Google index.To fix a Google indexing problem (including partial indexing) first check for website navigation problems which prevent Google crawling your website. If no website navigation problem is found, we recommend getting more quality links to your website from other WWW websites. You should also create an XML sitemap and submit it to Google.

An XML Sitemap lists all page URL's on your website and can provide additional information like "Priority" (which defines the relative importance of each page in your website hierarchy) and "Change Frequency" (which specifies how often your content is updated). This will help to encourage Googlebot to deep crawl your website and re-cache any recently updated page URL's.
Test Your Robots.txt File

Sometimes search engine indexing problems can be caused by Robots.txt file errors. Robots.txt is a small (non-mandatory) text file which is uploaded to the root directory of a web server to tell search engine robots which web pages and website assets (folders, images) should be excluded from search engine indexation.

A simple syntax error in the Robots.txt file could totally prevent Googlebot (Google's search spider or 'crawler') from indexing your website. For help read creating and formatting a Robots.txt file. Google Webmaster Tools allows a test to be carried out on a Robots.txt file under "Site Configuration" > "Crawler Access". This can help to find Googlebot crawling problems caused by Robot exclusions.

Adding a URL to the Google Index

If you're wondering how to Google index a particular page or website which is so far not indexed you may wish to try submitting your URL to Google. Google Account holders may now submit URL's for consideration to Google by visiting the Google Add URL page. To get professional help to solve difficult Google indexing problems, contact KSL Consulting (a reasonable consultancy fee applies). Significance of Grey Toolbar PageRank

If the Google Toolbar displays Grey PageRank on a particular page it may in fact still be Google indexed (so check the page cache using the site: command described above). When Grey PR shows on the Google Toolbar a "Google does not rank current page" message will usually appear on mouse-over/hover of the PageRank indicator.

The most common causes of greyed out Google Toolbar PageRank include:

Insufficient Page Rank Acquired - The page/s simply might not be getting enough "link juice". This might be the case where pages which are well down the website navigation hierarchy with no external links pointing to them. Web pages which are many clicks from the homepage may also display Grey Toolbar PageRank because they fail to acquire sufficient PageRank from other more important pages of the site.
Duplicate Content - Pages which are very similar to others or of low quality or little value may show Grey Google PR. Pages lacking any form of keyword focus may also suffer grey Page Rank.
Search Engine Exclusions - Pages which are excluded from the Google index as declared line entries in the Robots.txt file may be shown with a Greyed Out PageRank indication. These pages will typically not be Google indexed.
Banned Domains - Sites which have been banned from the Google index or suffered a Google Page Rank penalty may show a grey PageRank indicator. See our Google penalty page for help.
Factors Affecting Google Crawl Rate

The Googlebot crawling and indexing rate is influenced by a number of factors including:

Google Page Rank - The Page Rank (PR) of the website and individual pages influence crawl rate. Grey Page Rank pages will get cached far less often than pages with visible Toolbar Page Rank and the re-caching frequency will be higher for High PR pages than low PR. It is common for the Google cache of Grey Page Rank pages to only get updated once every three or four weeks.

Page Update Frequency - Google using intelligent crawling technology that automatically increases Googlebot activity to pages which are updated more frequently than pages which are updated infrequently. As Google has 80 billion websites to index, it aims to improve the efficiency of the Googlebot crawling process by avoiding pages which it has previously learnt rarely get updated. Matt Cutts prepared an interesting video on how Google crawls sites and we'd recommend taking 5 minutes to watch it: Matt Cutts Googlebot crawl method video. Quantity and Quality of Back-Links - As Googlebot follows links between websites on the Worldwide Web, the more inbound links a site and internal page has acquired, the more frequently the site and its pages will be re-cached. Even a few additional 'deep links' from other websites to the important internal pages of your domain could help increase the crawl rate and frequency of Googlebot visits, ensuring that the Google cache of your website is updated more regularly.

Server Response Codes - Googlebot checks the server response code for all requests. In other words when Googlebot tries to fetch a web page it checks the response the hosting web server gives. Typical responses include "200: OK" - the page is present and is rendered; or "301: Redirect Permanent" - the page has permanently moved to another URL. The latter would result in the old page URL being removed from the Google index and the new URL indexed. For more help and advice on acquiring additional inbound links for your website, read our advanced link building and Google SEO strategies articles.
Adjusting Google Crawling Rate in Webmaster Tools

If Googlebot is visiting your site too often or too infrequently the crawl rate can be adjusted from Google Webmaster Tools under "Settings" > "Crawl Rate" (see below for an example).

To get access to Google Webmaster Tools requires verification of site ownership, which is accomplished either by uploading a small 'verify file' onto the web server or adding a Meta tag to the homepage HTML code.

Google Big Daddy Update

Following the "Big Daddy" Google infrastructure update in the Spring of 2006, the crawling rate of websites is now heavily influenced by the number and quality of backlinks the site has acquired. For this reason, it is not unusual for a website with few inbound links to experience less Googlebot deep crawls.

After Google's Big Daddy update, many websites developed website indexing and Googlebot crawling rate problems. After Big Daddy, Google seems to be indexing fewer web pages, particularly on recently launched website domains and low quality sites.

Partial Google indexing is now common for websites with few inbound links. This frequently results in only the top hierarchy of pages being Google crawled and included in the Google index, with deeper internal pages (three or more clicks from the homepage) only being partially indexed or not indexed at all.

The Big Daddy update problems have long since been resolved, but many domains, even trusted sites are still left with significant numbers of non-indexed pages in the Google index. These pages would have shown up as Google Supplemental Results until the labelling of such pages was removed in the summer of 2007.

Practical implementation of X-Robots-Tag with Apache

You can add the X-Robots-Tag to a site's HTTP responses using .htaccess and httpd.conf files that are available by default on Apache based web servers. The benefit of using an X-Robots-Tag with HTTP responses is that you can specify crawling directives that are applied globally across a site. The support of regular expressions allows a high level of flexibility.

For example, to add a noindex, nofollow X-Robots-Tag to the HTTP response for all .PDF files across an entire site, add the following snippet to the site's root .htaccess file or httpd.conf file:
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

You can use the X-Robots-Tag for non-HTML files like image files where the usage of robots meta tags is not possible. Here's an example of adding a noindex X-Robots-Tag directive for images files (.png, .jpeg, .jpg, .gif) across an entire site:
<Files ~ "\.(png|jpe?g|gif)$">
Header set X-Robots-Tag "noindex"
</Files>

Combining crawling with indexing / serving directives

Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.

crawl - metatags - Home - paidlinks - contactUS - Resources - Partners