How Search Engines Read and Index Web Pages
This article is to provide you with the very first steps of making your website VISIBLE for search engines that is to understand how search engines work. Understanding how search engines work should happen before any SEO [Search engine optimization] works, and even before the website is created.
About search engines:
Search engines use web robots or web crawlers/ web spiders, which are programs traversing the website automatically, to index the web content to their database. Search engine’s job is to provide people with the information that is as close to what they searched for [known as search terms, keywords, key terms] as possible, no matter what form the requested information is in text, image, or video.
Thus, your site ranking [without paid search] depends largely on:
- The text content – how close the content of your website is to the search terms that people put in. The text content includes the your domain name, your page title, your body content…
- The site design and coding – whether your website is coded in a way that search engine can easily crawl the content through out the entire site, and easily find what your site is about to index to its database…
- How useful your content is – showing through the amount of traffic/repeat traffic to your site, and links to your website from other sites, as well as links from your site to other sites.
- How fresh and updated your content is over time.
- How old your site is – new domain certainly has less favor from an already established one from a search engine standpoint.
How search engine crawl your website:
1. Crawling the robots.txt file: search engine crawlers first look through your robots.txt file – if any, in the root folder to receive instructions on what files/folders should be ignored from the crawl. By default, web spiders crawl your entire website. If there are any files/folders that you don’t want to be crawled and indexed, those should be clearly specified in the robots.txt file.
Robots.txt file is a simple text file and coded like below. See post about robots txt file for more detailed information. Here’s a simple sample of rotbots.txt containing instructions to disallow all search engines crawling and indexing 3 folders: cgi-bin, tmp, and blindsites.
2. Crawling the head area: when a search engine finds a certain page of your website, it takes a look at the head area of the page — all of the content inside the <HEAD> </HEAD> tag of the web page. The HEAD area is crawled first simply because it’s at the top of your page, also it contains all of the basic/meta information about your page/site including the page title; the meta keyword and meta description tags; the robots meta tag which is sometimes used to add instructions or even override instructions written in the robots.txt file mentioned above. Below is an example of the HEAD area of a web page:
<title>Web page title</title>
<meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″ />
<meta name=”description” content=”Your web page/ website description” />
<meta name=”keywords” content=”Your web page/ website keywords separated by commas” />
<meta name=”robots” content=”noarchive” />
3. Crawling your website body area: Right after the HEAD section of your web page is the body content encapsulated inside the <BODY> </BODY> tag. This is the main content of your web page. If the page title in the HEAD section is the main subject – what your web page is mainly about, the BODY section often has the information to support that main subject.
Spiders also follow links in the body content of a certain page to get to other pages within your website or to an external site. Links can be from the main navigation or text/image links found in the body content.
Above is the brief description about how search engines work. Next, you may want to know further about how to use the above information and make your site more visible, more search-engine-friendly.
Related post: Ways to increase website traffic & ranking.