What is a robots.txt file? Best practices rules and Functions

What is a robots.txt file

Robots.txt is actually a text file created by webmasters. The basic function of Robots.txt is to define rules, to instruct search engine robots, (web-robots) how to crawl pages on their website.

The robots.txt file is a basic component of the “robots exclusion protocol” (REP), a set of rules of web standards that control how robots crawl the websites, access and index content, and serve that content material up to users. The REP also includes directions like meta robots, page content, media files and subdirectory. The REP also includes site-wide instructions for how search engines should treat web links.

practically robots.txt files allow, whether specific user agents such as “web-crawling software” can or cannot crawl the different parts of a website. These crawl guidelines are specified by allowing or disallowing the patterns of particular user agents.

Basic format:
User-agent: *
Disallow: /

We have a big question here. why do we need robots.txt?
Robots.txt files actually control (Robots) crawler access to particular fields/URL/directories of your site. it could be very dangerous if Googlebot is
accidentally disallowed from crawling. So be careful while creating or editing your robots.txt file, there are some certain situations in which a robots.txt file can be very helpful.

Some common uses

  • Avoiding duplicate content from appearing in SERPs.
  • Keeping private the entire sections of a website.
  • Keeping internal search results pages from showing up on a public SERP
    Specifying the location of sitemap(s)
  • Blocking search engines from indexing particular files on your website, for example, any media file like images and PDFs.
  • You can also specify a crawl delay in order to minimize your servers from becoming overloaded when crawlers load multiple contents at once.
Search Engine Optimization — Top SEO Tips For Beginners

SEO (Search Engine Optimization) best practices

robots.txt seo best practices

  • You need to make sure that you’re not blocking any specific sections of your website or content that you want to be crawled.
  • Keep in mind that Links on pages blocked by robots.txt will not be followed or crawled.
    That means
  • if any link that you blocked which is also linked from any other search engine-accessible pages, the linked asset/content or information will not be crawled and may not be indexed on those search engines like bing or Yandex, etc.
  • No link equity can be crawled from the blocked page to the linked resources. If you have any specific pages to that you want equity to be crawled, use any other blocking system other than robots.txt.
  • never use robots.txt to protect sensitive data like “private user information” from showing up in SERP results. Because other pages on the web may link directly to that page containing private information. therefore by skipping the robots.txt directives on your root directory (domain or homepage), it may possibly still get indexed. So, If you want to block your specific page or information from indexing, use any method like “NOINDEX meta directive” or password protection.
  • Typically, search engines have multiple user-agents. For example, Google utilizes Googlebot for organic search and Googlebot-Image for image search. Mostly the user agents from the same search engine follow the same rules so there is no need to have specific directives for each of a search engine’s multiple agents.
  • A search engine definitely will cache the robots.txt contents, but usually updates the cached contents at least once a day. If you change the file and want to update it more quickly than it is taking place, you can submit your robots.txt URL to Google.
How to Change Default Admin Username In WORDPRESS — 3 Simple Ways

Some basic robots.txt must-knows components

  • robots.txt file must be placed in a website’s top-level (root)directory.
  • the file |should always be named “robots.txt” (not robots.TXT
    or Robots.txt.
  • Some user agents (robots) might decide to ignore robots.txt file. This is specifically common with more negative crawlers like email address scrapers and malware robots.
  • The www.example.com/robots.txt file is publicly available. just add /robots.txt to the end of any root domain to see that website’s directives, means anyone can see what web-pages you do or don’t want to be crawled,
  • Each domain and subdomain must have separate robots.txt files. That means both www.blog.xyz.com and ww.xyz.com must have their own separate robots.txt files on their root directories. ( for example blog.xyz.com/robots.txt and xyz.com/robots.txt)
  • it is a best practice to add the website’s sitemaps linked with the domain at the bottom of the robots.txt file.

https://www.techsaa.com/search-engine-optimization-top-seo-tips-for-beginners/

For example

User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html
Sitemap: https://www.xyz.com/sitemap.xml
How to speed up your WordPress Website — Tips And Tricks

How to create a robots.txt file

How to create a robots.txt

If you found your website does not have a robots.txt file or you want to modify, creating one is a simple process. You can have further guidelines from Google’s Article which is specific about the robots.txt file creation process, and this robots.txt testing tool permits you to test if your file is set up correctly.

If you found that your website doesn’t have a robots.txt file then you can create it very easily by sung notepad or any other text editor. Copy that created robot.txt file into your website’s root directory (example: ww.xyz.com/robot.txt). You can copy this file using a FileZilla FTP client or Cpanel. After copying the file you need to set 644 file permission.
A simplest and most powerful robots.txt setting/rules which I recommend upon best of my limited knowledge are given below.

How to combine images using CSS sprites without a plugin — Tutorial

Example No. 1

User-Agent: *
Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /wp-admin/
Disallow: /readme.html
Disallow: /refer/
Sitemap: https://www.yourdomainname.com/sitemap_index.xml

You can simply copy that lines into your robots.txt or you can modify your existing new robots.txt.
Note: replace www.yourdomainnmae.com/sitemap_index.xml with your own website’s domain name and Sitemap.xml name.

Another robots.txt file example which I want to share with you is:

Example No. 2

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-content/themes/your theme name/
User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: *
Disallow: /search
Disallow: /cgi-bin/
Allow: /

User-agent: *
Disallow: /*.html
Allow: /*.html$
Sitemap: https://www.yourdomainname.com/sitemap_index.xml

But you can see there are separate rules are defined for each of User-agent, which is little complicated for newbies. So, therefore, I recommend the first example for newbies.

Thanks for Reading