Sarchy : A New Hope in Search

A new implementation of the not-so-young, open source search engine YaCy

Jeudi 28 mars 2019, par Emmanuel Barthe // Logiciels, Internet, moteurs de recherche

Sarchy (URL : sarchy.tech) is an intriguing faceted search engine (with RSS saved search) based on the open source YaCy search engine and developped by Agnel Vishal (Twitter @agnelvishal), a developper from Chennai in the Tamil Nadu region of India. Sarchy has been detected by one of the best French monitoring specialists, Christophe Deschamps (TW @crid ; blog Outils Froids) and relayed by Serge Courrier (TW @secou) of RSS Circus, another French monitoring specialist.

Sarchy is based on the (rather old) YaCy open source search engine

Sarchy is not really a newcomer. It is based on the open source search engine YaCy, which is already 8 years old. YaCy is a distributed peer-to-peer search engine written by a team of German developers. The source code is hosted on GitHub. According to its web site, « you don’t need to install external databases or a web server, everything is already included ».

To be honest, I tested a YaCy implementation some years ago and I wasn’t impressed at the time. And Sarchy’s performances, especially the width of its index (for instance, it indexes quite slowly and poorly the lemonde.fr domain) doesn’t make it competitive in any way with Google or Bing. Nevertheless, *this* implementation of YaCy is very interesting.

According to Agnel Vishal :

  • Sarchy is a fork of YaCy. YaCy does not use pagerank algorithm but Sarchy uses one. Also, Vishal says he uses social media statistics as a ranking parameter
  • Sarchy’s index is a part of Yacy P2P network, but at the same time, Sarchy makes YaCy’s index accessible as a webapp [1]
  • the total number of web pages in YaCy’s index is around 1,7 billion. Sarchy launched a week back and has 2,43 million webpages
  • he plans to increase the crawl speed by 30 times within 2 to 3 weeks
  • he got 3000 USD Google cloud credits thanks to YC startup school. He hopes to get revenues in advertisements and donations before the cloud credits gets over. Let’s hope he will be able to obtain that or other financing in the near future.

As Serge Courrier signals, one can integer RSS feeds. Also, there is a desktop version of Yacy.

And, as argued by YaCy’s lead developper and the Free Software Foundation Europe (FSFE), which supported the YaCy project, this peer-to-peer search engine doesn’t monitor your search and doesn’t do targeted advertising [2]

Relevancy still an issue

I have just tested Sarchy with my favorite, French law oriented, test query — and some others.

The (limited compared to competitors) content indexed is of good quality in my experience. But in the legal field, at the very least, relevancy on Sarchy remains an issue. Sarchy, contrary to Google, does not seem able to guess a query’s context, not even know the query words’ synonyms (in other words, Sarchy doesn’t do machine learning version of natural language processing.

I reckon that, for the time being, relevancy is hampered by the lack of indexed content. In the legal field, I would suggest better, relevancy oriented indexing of official, Gov’t and public institutions web sites (they have good, though free, quality content and Sarchy already indexes them or at least knows their domains).

Agnel Vishal answered my remark : as soon as one searches for a page/site, the crawler automatically starts crawling related pages. To me, that’s a very good idea : it keeps the index from indexing unnecessay pages. But at the same time, there is an associated spamdexing risk. In turn, YaCy’s Twitter account explained that YaCy does link reloading to verify that the presented link actually contains the searched words to protect against spam indexes.

Of course, link reloading, content checking and a distributed architecture mean that response time is somewhat slow (4-5 seconds on an enterprise Internet connection). But I didn’t find it that annoying.

According to Vishal, in order to get faster results, the whole database is not scanned the first time a given search is done. One should try the same query 30 seconds later and may see more webpages.

Also, since relevancy is still somewhat limited (according to my tests), it would be very useful to explain clearly somewhere on the home page what Sarchy’s operators are. The simple use of quotes (" ") on Sarchy is a big bonus to relevancy.

Looking at YaCy self-hosted engine presentation, using it as an alternative to Google CSE is possible.

Search operators and filters

As in Google, one can use site :http://justice.gouv.fr to get results from that domain. For example : https://sarchy.tech/yacysearch.html?query=site%3Ajustice.gouv.fr&Enter=&contentdom=all&strictContentDom=false&former=justice.gouv.fr+site%3Ajustice.gouv.fr&maximumRecords=10&startRecord=0&verify=ifexist&resource=global&nav=all&prefermaskfilter=&depth=0&constraint=&meanCount=0&timezoneOffset=-330

Good to know : YaCy search operators are detailed on its wiki.

One of the main advantage of Sarchy over YaCy’s own portal is its facets (left column in the results page) : domains, year, language ... These suggestions on how to refine your search are practical and relevant. Also, Sarchy works. While YaCy Search is not, right now.

Vishal says search operators list will be added to Sarchy’s home page in 24 hours. It will have location, date, distance between words etc.

What’s funny is that less than two weeks after Sarchy was spotted by Christophe Deschamps, Ahrefs [3] CEO Dmitry Gerasimenko tweeted he wants to build a new search engine with the collaboration of publishers and other online content makers ... [4] Although most SEOs who answered his thread are skeptic, with the growing success of Duck Duck Go and in our French and German lands Qwant, it could be the sign of something serious. The business model he proposes, at least, makes sense.

Emmanuel Barthe
French law librarian reseearcher, monitoring/CI specialist
search engine enthusiast (ex-Google de facto evangelist, ca. 1997, still a Google specialist for law research)

More info about YaCy and Sarchy’s implementation

Notes de bas de page

[1According to Fabrica, INRIA’s blog, YaCy’s main weak point, YaCy’s main weak point is that, being a distributed search engine, it « requires many users to achieve efficient indexing » (Yacy, the peer-to-peer search engine, Fabrica, 19 January 2016). Over the course of eight years, YaCy very slowly got installed on a little more computers than the 600 they had in 2011. But not enough. Ideally everyone should install YaCy on their own computer. But very few people are willing to take the effort.

[2Free Software Activists to Take on Google With New Free Search Engine, par Jennifer Baker, PCWorld.com, 29 novembre 2011. YaCy : It’s About Freedom, Not Beating Google, par Katherine Noyes, PCWorld.com, 2 décembre 2011.

[3Ahrefs is a well known/reputed SEO tool publisher. Other major actors in this field include : Moz Pro, SEMrush and Majestic.

[4Ahrefs To Compete With Google Search & Share The Wealth With Publishers, par Barry Schwartz, SEO Roundtable, 28/03/2019.

Répondre à cet article