Earlier this month, Yandex turned up results from Google Docs that clearly were not meant to be visible to a broader public.
On the night of 4th-5th July, Russia’s premier search engine Yandex showed it is capable of returning shared documents in its search results. Searching for passwords, for example, turned up sheets filled with password information that no one would share openly on a website. The social media uptake was quick: it did not take long or documents were found such as an internal note (apparently) from the human resource department of a bank stating that gay people and persons with non-slavic names should not be considered for a job. Within a day the Russian media agency Roskomnadzor replied with an official request for information on the data leak.
From my experience as data and subject librarian I know many researchers and teachers use Google Docs or Dropbox to share research proposals and sensitive data such as exam results. Particularly Google’s tools are easy to use, but they might be potential data leaks. For example, did you ever expect your privately shared links to Google Docs to turn up in a search engine? No? Think again…
Yandex, which is bigger than Google in Russia, claims it only indexes documents that are directly available through a hyperlink without using a password. Besides, if the administrator of the website (Google does this through its rules) provides information in a so called robots.txt file that the link should not be indexed, Yandex does not index. This is a standard agreement within the search engine community. In this case, Google does allow indexing of openly available documents. But it is not a good idea to just rely on the robots.txt standard, as it is only an agreement and does not technically prevent search engines (or people with criminal intentions) from crawling.
Even if you are allowed to index these files, they are hard to find. If you post a link on your website, it will be indexed without a problem. In other cases these documents are hard to find: the hyperlink itself is made up of random characters. In theory it is possible to randomly check many links and try to crawl them, but Google is sure to notice such behaviour. If you think you are safe as long as you do not post the link: this is only security through obscurity. Once the secret has been guessed or revealed, anyone can access.
Obscurity is a key word in describing the way search engines create their index and ranking as well. Just like Google, Yandex has its own mail services (Yandex Mail like Gmail) and browser (Yandex Browser like Chrome). It is speculated that these might have provided the necessary obscured urls as Yandex Browser has been known to send browser histories back to the company.
As Google itself states, there are ways to prevent your data from leaking in case of an overeager search engine. For example setting a password on a shared document. Or only sharing by inviting other Google users. (to find out the settings, check underneath ‘Share’->’Advanced’->’Who has access’->’Change’ what is allowed for the particular document)
However, seeing what companies do with our information, it might be a better idea to stop putting our sensitive data in their hands in the first place. For academics there is a way to share data in a more secure environment through SurfDrive, a service set up and managed by a trusted (non-commercial) party. It complies with Dutch and European privacy legislation and does not provide entrusted information to third parties. Here again, it is essential that we share by invitation only!
Note: even with a service such as SurfDrive, please do not use it to share data that have tight privacy issues, such as patient data.
Illustrations with thanks to https://digitalbevaring.dk/