Sarchy : A New Hope in Search
A new implementation of the not-so-young, open source search engine YaCy
Sarchy [update as of 5th Aug. 2024 : the sarchy.tech domain is for sale, all links to Sarchy in this post are now dead] is an intriguing faceted search engine (with RSS saved search) based on the open source YaCy search engine and developped by Agnel Vishal (Twitter @agnelvishal), a developper from Chennai in the Tamil Nadu region of India. Sarchy has been detected by one of the best French monitoring specialists, Christophe Deschamps (TW @crid ; blog Outils Froids) and relayed by Serge Courrier (TW @secou) of RSS Circus, another French monitoring specialist.
Sarchy is based on the (rather old) YaCy open source search engine
Sarchy is not really a newcomer. It is based on the open source search engine YaCy, which is already 8 years old. YaCy is a distributed peer-to-peer search engine written by a team of German developers. The source code is hosted on GitHub. According to its web site, « you don’t need to install external databases or a web server, everything is already included ».
To be honest, I tested a YaCy implementation some years ago and I wasn’t impressed at the time. And Sarchy’s performances, especially the width of its index (for instance, it indexes quite slowly and poorly the lemonde.fr domain) doesn’t make it competitive in any way with Google or Bing. Nevertheless, *this* implementation of YaCy is very interesting.
According to Agnel Vishal :
- Sarchy is a fork of YaCy. YaCy does not use pagerank algorithm but Sarchy uses one. Also, Vishal says he uses social media statistics as a ranking parameter
- Sarchy’s index is a part of Yacy P2P network, but at the same time, Sarchy makes YaCy’s index accessible as a webapp [1]
- the total number of web pages in YaCy’s index is around 1,7 billion. Sarchy launched a week back and has 2,43 million webpages
- he plans to increase the crawl speed by 30 times within 2 to 3 weeks
- he got 3000 USD Google cloud credits thanks to YC startup school. He hopes to get revenues in advertisements and donations before the cloud credits gets over. Let’s hope he will be able to obtain that or other financing in the near future.
As Serge Courrier signals, one can integer RSS feeds. Also, there is a desktop version of Yacy.
And, as argued by YaCy’s lead developper and the Free Software Foundation Europe (FSFE), which supported the YaCy project, this peer-to-peer search engine doesn’t monitor your search and doesn’t do targeted advertising [2]
Relevancy still an issue
I have just tested Sarchy with my favorite, French law oriented, test query — and some others.
The (limited compared to competitors) content indexed is of good quality in my experience. But in the legal field, at the very least, relevancy on Sarchy remains an issue. Sarchy, contrary to Google, does not seem able to guess a query’s context, not even know the query words’ synonyms (in other words, Sarchy doesn’t do machine learning version of natural language processing.
I reckon that, for the time being, relevancy is hampered by the lack of indexed content. In the legal field, I would suggest better, relevancy oriented indexing of official, Gov’t and public institutions web sites (they have good, though free, quality content and Sarchy already indexes them or at least knows their domains).
Agnel Vishal answered my remark : as soon as one searches for a page/site, the crawler automatically starts crawling related pages. To me, that’s a very good idea : it keeps the index from indexing unnecessay pages. But at the same time, there is an associated spamdexing risk. In turn, YaCy’s Twitter account explained that YaCy does link reloading to verify that the presented link actually contains the searched words to protect against spam indexes.
Of course, link reloading, content checking and a distributed architecture mean that response time is somewhat slow (4-5 seconds on an enterprise Internet connection). But I didn’t find it that annoying.
According to Vishal, in order to get faster results, the whole database is not scanned the first time a given search is done. One should try the same query 30 seconds later and may see more webpages.
Also, since relevancy is still somewhat limited (according to my tests), it would be very useful to explain clearly somewhere on the home page what Sarchy’s operators are. The simple use of quotes (" ") on Sarchy is a big bonus to relevancy.
Looking at YaCy self-hosted engine presentation, using it as an alternative to Google CSE is possible.
Search operators and filters
As in Google, one can use site :http://justice.gouv.fr to get results from that domain. For example : https://sarchy.tech/yacysearch.html?query=site%3Ajustice.gouv.fr&Enter=&contentdom=all&strictContentDom=false&former=justice.gouv.fr+site%3Ajustice.gouv.fr&maximumRecords=10&startRecord=0&verify=ifexist&resource=global&nav=all&prefermaskfilter=&depth=0&constraint=&meanCount=0&timezoneOffset=-330
Good to know : YaCy search operators are detailed on its wiki.
One of the main advantage of Sarchy over YaCy’s own portal is its facets (left column in the results page) : domains, year, language ... These suggestions on how to refine your search are practical and relevant. Also, Sarchy works. While YaCy Search is not, right now.
Vishal says search operators list will be added to Sarchy’s home page in 24 hours. It will have location, date, distance between words etc.
Ahrefs is working on general purpose search engine to compete with Google. Sounds crazy, right?
— Dmitry Gerasimenko (@botsbreeder) 27 mars 2019
But lets talk about two huge problems with Google which they will never want to fix:
What’s funny is that less than two weeks after Sarchy was spotted by Christophe Deschamps, Ahrefs [3] CEO Dmitry Gerasimenko tweeted he wants to build a new search engine with the collaboration of publishers and other online content makers ... [4] Although most SEOs who answered his thread are skeptic, with the growing success of Duck Duck Go and in our French and German lands Qwant, it could be the sign of something serious. The business model he proposes, at least, makes sense.
Emmanuel Barthe
French law librarian reseearcher, monitoring/CI specialist
search engine enthusiast (ex-Google de facto evangelist, ca. 1997, still a Google specialist for law research)
More info about YaCy and Sarchy’s implementation
- Agnel Vishal’s blog
- Crunchbase’s Condense.Press profile
- Condense.Press FB account
- a PDF presentation of YaCy by its lead developper : Web Search by the people, for the people, by Michael Christen, Rencontres Mondiales du Logiciel Libre (RMLL) 2011 (PDF, 26 pages)
- YaCy web site
- YaCy Wiki. Exists in French and in German
- YaCy video tutorials channel on Youtube
- some Twitter threads to check :
https://twitter.com/yacy_search/status/1106557416743657473
https://twitter.com/secou/status/1106499175078920198
https://mobile.twitter.com/precisement/status/1106725707068174341
https://mobile.twitter.com/agnelvishal/status/1106545049746042881
Notes
[1] According to Fabrica, INRIA’s blog, YaCy’s main weak point is that, being a distributed search engine, it « requires many users to achieve efficient indexing » (Yacy, the peer-to-peer search engine, Fabrica, 19 January 2016). Over the course of eight years, YaCy very slowly got installed on a little more computers than the 600 they had in 2011. But not enough. Ideally everyone should install YaCy on their own computer. But very few people are willing to take the effort.
[2] Free Software Activists to Take on Google With New Free Search Engine, par Jennifer Baker, PCWorld.com, 29 novembre 2011. YaCy : It’s About Freedom, Not Beating Google, par Katherine Noyes, PCWorld.com, 2 décembre 2011.
[3] Ahrefs is a well known/reputed SEO tool publisher. Other major actors in this field include : Moz Pro, SEMrush and Majestic.
[4] Ahrefs To Compete With Google Search & Share The Wealth With Publishers, par Barry Schwartz, SEO Roundtable, 28/03/2019.
Commentaires
Aucun commentaire
Laisser un commentaire