Often when going though web error logs of clients I find that they are usually missing a robots.txt file. Most people don’t see it and it’s one of those things that falls through the cracks in the excitement of building a website.
There are hundreds of ways to build a robots.txt file but for the most part a simple text editor and some patience are all you need to create one quickly and easily. Read More Below...
Once you have the list of files and folders it’s time to build your robots.txt
A little syntax lesson. There are a few commands that you can use in a robots.txt file
# is a comment
User-agent: is what crawler you want the following commands to work with User-agent: * will apply to EVERYTHING
Disallow: What you want to not allow Disallow:* will block EVERYTHING.
Sitemap: Where your sitemaps.org formatted site map is located.
Ok now an example…
User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /folder/file.html
Sitemap: http://www.thehelpcenters.com/sitemap.xml.gz
So now to explain line by line
- EVERY Crawler must follow the following rules
- Comment
- Don’t allow crawlers to crawl /cgi-bin
- Don’t allow crawlers to crawl /folder/file.html
- Use this file as your site map.
You can exclude folders or files from the crawlers and even specify which crawlers. Most often you don’t need a very complex robots.txt and the time you spend on it will reduce bandwidth, duplicate or incorrect content in the search engines and help guide search engines on what to include in your site. A few more examples are
For a FrontPage website
User-agent: *
Disallow: /_private/
Disallow: /_borders/
Disallow: /_derived/
Disallow: /_fpclass/
Disallow: /_overlay/
Disallow: /_themes/
Disallow: /_vti_bin/
Disallow: /_vit_cnf/
Disallow: /_vti_log/
Disallow: /_vti_pvt/
Disallow: /_vti_txt/
For a WordPress Website
User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /stats/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /contact/
Disallow: /tag/
Disallow: /wp-content/
Sitemap: http://www.thehelpcenters.com/sitemap.xml.gz
If you want to dig a bit deeper take a look at http://www.robotstxt.org/wc/norobots.html for more information
1 response so far ↓
1 Let Adsense Crawl free! | Adsense Lane // Jun 17, 2008 at 3:57 pm
[...] is the robots.txt - This is a little text file that helps web site crawlers find out more about your website and how [...]
You must log in to post a comment.