It has been a little over a year since Google announced the start of the work to define an official standard for the Robots.txt file: in recent months, the company has already released in the open source world the parser and matcher robots.txt that uses in its production systems and, as explained by an article just published on the blog for webmasters, has “saw people build new tools with it, contribute to the open source library (effectively improving our production systems, thanks!) and release new language versions like golang and rust, which make it easier for developers to create new tools”.

The latest news for the open source robots

In addition to summarizing this work, the post highlights two new versions related to robots.txt that “were made possible by two interns working in the Search Open Sourcing team, Andreea Dutulescu and Ian Dolzhanskii“, as a thank you to the conclusion of their intern season at Google.

The first is the mind behing the robots.txt Specifications test that is being currently released, while the second has collaborated on the Java robots.txt parser and matcher.

What is the Robots.txt Specification Test

As said, Andreea Dutulescu has created the testing framework for developers of parser robots.txt, currently being released, which is a test tool that can test whether a parser robots.txt follows the Robots Exclusion Protocol or to what extent. Currently, we read in the article, “there is no official and complete way to assess the correctness of a parser, so Andreea has developed a tool that can be used to create parser robots.txt that are following the protocol”.

What is the Java robots.txt parser and matcher

The second news is the official publication of a Java port of the robots.txt parser C ++, created indeed by Ian. Java is the third most popular programming language on Github and is also widely used by Google, so it’s no wonder that it was the most requested language porting.

The parser is a 1-to-1 translation of the C++ parser in terms of functions and behavior and has been carefully tested for parity over a wide collection of robots.txt rules. Teams are already planning to use the Java robots.txt parser in Google’s production systems, they write from Mountain View.

The parser and matcher library

On the Github page of this tool you can read some more information, starting from a quick parenthesis on the Robots Exclusion Protocol (REP), defined the “standard that allows website owners to control which URLs automatic clients (such as crawlers) can access through a simple text file with a specific syntax“.

This protocol is one of the basic building blocks of the Internet as we know it and what allows search engines to work, but it also reiterates here that “it has only been a de facto standard in the last 25 years”: this determined that several actuators implement the robots.txt parser in a slightly different way, generating confusion.

The project in question aims to solve this chaos by releasing the parser used by Google: the library is slightly modified (for example in some internal headings and equivalent symbols) compared to the production code used by Googlebot to determine which URLs it can access according to the rules provided by webmasters in robots.txt files. The library is released open source to help developers create tools that better reflect the analysis and correspondence (literal parser and matcher translations) of the Google robots.txt file.


Call to action