It has always been a controversial topic, which has generated controversy and real frontal attacks (such as the one from Wall Street Journal in November 2019) and therefore required a new clarification intervention, perhaps definitive. Danny Sullivan, Google’s public voice, has in fact signed an article in which he explains how the changes to the Search ranking are implemented, focusing particularly on the role of Google’s quality raters in this process.

An ever-changing search system

“Every search you do on Google is one of the billions we receive that day,” debuts the Public Liaison for Search of Mountain View, which recalls how “in less than half a second, our systems select hundreds of billions of web pages to try to find the most relevant and useful results available”.

But this system cannot be static, we know, also because “the needs of the Web and people’s information continue to change”, and so Google makes “many improvements on our search algorithms to keep up”, at the rate of thousands per year (about 3200 changes in 2018 alone, for instance).

Google works to always improve results

The goal of Google, reiterated in numerous other circumstances, is “to always work on new ways to make our results more useful, whether it is a new feature, or offering new ways of understanding the language in search” (this is the explicit case of Google BERT).

These improvements are approved at the end of a precise and rigorous evaluation process, designed so that people around the world “will continue to find Google useful for everything they are looking for”. And Sullivan points out that there are some “ways in which insights and feedbacks from people around the world help improve the research”.

The job of Google’s research team

Generally speaking, Google works to make it easier for people to find useful information, but the vastness of the audience also determines that users have different information needs depending on their interests, the language they speak and their position in the world.

So, the basic mission is to make the information universally accessible and useful, and to this contributes the specific Google research team that has the task of getting in touch with people around the world to understand how the Search can be more useful. People are invited to provide feedbacks on different iterations of projects, or it is the same working group doing field research to understand how users in different communities access online information.

The example of Google Go: insights to meet the needs

Sullivan also provides us with a practical example: “over the years we have learned the unique needs and technical limitations that people in emerging markets have when accessing online information,” and this has led to the development of Google Go, “a lightweight search app that works well with less powerful phones and less reliable connections”. On the same app, subsequently, Google introduced “extraordinarily useful features, including one that allows you to listen to web pages aloud, especially useful for people who learn a new language or who might be uncomfortable with reading long texts “which would not have been developed without the right insights of the people who eventually use them.

The commitment for result quality

At the same time there is constant work on the actual operation of the search engine and on the quality of the results proposed to users. As the Googler says, “a key part of our evaluation process is getting feedback from everyday users that our rating systems and proposed improvements are working well”.

That is, that the Serps bring out quality content, as explained in detail in the guidelines for search quality rating (longer than 160 pages), the meaning of which can be summarized by saying that “Research is designed to return relevant results from the most reliable sources available”.

To determine certain parameters, Google systems automatically use “signals from the Web itself – for example, where words from your search appear on Web pages or how pages connect to each other on the Web – to understand what information is related to your query and whether it is information that people tend to trust”. However, the notions of relevance and reliability” are ultimately human judgments, so to measure whether our systems are actually understanding them correctly, we need to gather insights and guidance from people”.

Who the Search quality raters are

This is the task of search quality raters, a “group of over 10,000 people around the world” that helps Google “measure the way people are likely to come into contact with our results”. These collaborators and external observers “provide assessments based on our guidelines and represent real users and their likely information needs, using their best judgment to represent their location”. These people, Sullivan says, “study and are tested on our guidelines before they can start providing ratings”.

How an evaluation works

The blog article on The Keyword also describes the standard evaluating process of quality raters.

Google generates “a sample of queries (let’s say, a few hundred), which it assigns to a group of raters, which are shown two different versions of the results pages for such searches [a kind of A/B test, basically]. One set of results comes from the current version of Google and the other set comes from an improvement that we are considering”.

The raters “review each page listed in the results set and evaluate that page against the query“, referring to the indications in the above guidelines, and in particular “determine whether those pages meet the information needs based on their understanding of what that query was looking for” (i.e., whether they respond to search intent) and “consider elements such as how authoritative and reliable that source seems to be on the subject in the query”.

The analyses on the EAT paradigm

To assess “parameters such as competence, authority and reliability – sometimes referred to as “EAT” – raters are asked to do reputational research on sources”, and Sullivan offers a further example to simplify this work.

“Imagine that the query is carrot cake recipe: the result set can include articles from recipe sites, food magazines, food brands and maybe blogs. To determine whether a web page meets information needs, a evaluator may consider how easy cooking instructions are to understand, how useful the recipe is in terms of visual instructions and images, and whether there are other useful features on the site, as a tool to create a shopping list or an automatic calculator to change doses”.

At the same time, “to understand if the author has experience in the subject, a rater will do some online research to see if the author has qualifications in cooking, whether it has profiles or references on other food-related websites or has produced other quality content that has received positive reviews or ratings on recipe sites”.

The underlying goal of this search operation is “to answer questions like: is this page trustworthy and comes from a site or author with a good reputation?”.

Evaluations are not used for the ranking

After the evaluators have done this research, they then provide a quality score for each page. At this point, Sullivan strongly stresses that “this evaluation does not directly affect the ranking of this page or site in the search”, thus reiterating that the work of quality raters has no weight on the ranking.

In addition, “no one is deciding that a certain source is authoritative or reliable” and “the pages are not assigned ratings as a way of determining how well to classify them”. And it could not be otherwise, says Sullivan, because for us this “would be an impossible task and above all a mediocre signal to use: with hundreds of billions of pages constantly changing, there is no way that humans can evaluate every page on a recurring basis”.

On the contrary, the ranking consists of “a data point that, taken in aggregate form, helps us to measure the effectiveness of our systems to provide quality content in line with the way people, around the world, evaluate information”.

What the evaluations are for

So what are these human interventions really for? Sullivan explains it again by revealing that only last year Google has “carried out over 383,605 Search quality tests and 62,937 experiments side by side with our search quality raters to measure the quality of our results and help us make over 3,600 improvements to our search algorithms”.

Live experiments

To these two types of feedback is added an additional system used to make improvements: Google must “understand how a new feature will work when it is actually available in Search and people use it as they would in real life“. To make sure we can get this information, the company heads “the way people interact with new features through live experiments”.

These live tests are “actually available for a small portion of randomly selected people using the current version of Search” and “to test a change, we will start a function on a small percentage of all the queries we receive and examine a number of different metrics to measure the impact”.

It is about having answers to questions like “did people click or tap on the new feature? Did most people ignore it? Did it slow the page load?” , which generate insights that “can help us understand a little bit if that new feature or change is useful and if people will actually use it“.

Also last year, Google “ran over 17,000 real-time traffic experiments to test new features and search improvements”. Compare this number with the actual number of changes made (about 3600, as mentioned before), we understand that “only the best and most useful enhancements land in Google Search“.

The goal to offer more and more useful results

Admitting that “our search results will never be perfect“, Sullivan concludes that however “these search and evaluation processes have proven very effective over the past two decades”, allowing Google to “make frequent improvements and ensure that the changes made represent the needs of people around the world who come looking for information”.


Image credits:


Call to action