Jan-Philipp Litza

Document management, part 2: Nextcloud full text search

Nextcloud is my favorite piece of self-hosted software. It is so universal that it replaced almost everything else I used before. And I remember my surprise about the simplicity of setting up calendar and contact syncing via CalDAV a decade ago using Nextcloud, compared to earlier self-hosted solutions.

After my experiments with Paperless-ngx, I wanted to give one particular feature a go that I didn't use previously: Full text search.

The reason why I didn't use that feature earlier is simple: Its official implementation requires you to set up an Elasticsearch server. I'm sure that's a great fit for enterprise usage of Nextcloud, and in my dayjob I operated several Elasticsearch clusters of various sizes. But in the family IT, that software seems very much overkill - both in terms of resource requirements as well as skill and time required to operate.

What always amused me was that the Nextcloud engineers were considerate enough to make the whole framework very modular - provider apps provide the content to be indexed, platform apps implement the indexing and searching - and yet nobody every implemented another platform app than the official Elasticsearch one! There certainly is other software specialized on searching (Solr comes to my mind), and usually someone somewhere then implements an integration into a software so popular as Nextcloud. But not in this case.

My idea, however, was much simpler. Nextcloud already has a database backend, why not use it for searching? I knew MySQL had full text search capabilities, so I assumed PostgreSQL did as well, usually being the more advanced competitor in my impression. I didn't even know what other backends were supported, but surely supporting those most widely used would be most important!

So I dove into a jungle of code, split up between the Nextcloud server core (where - surprisingly - all of the interface definitions lived), the fulltextsearch app and the fulltextsearch_elasticsearch app. Along the way I found more and more documentation for Nextcloud developers, especially about how database access worked. But nothing really explained the whole thought model around the full text search framework with its services, runners and interfaces. So I was left with only one option: Hack away and see where it leads!

Well, it lead to the first ever Nextcloud app by yours truly: fulltextsearch_sql. It does exactly what I imagined, and it does so quite well actually!

Of course this is a mere proof of concept. The code is a spaghetti mess and the number of TODOs riddled throughout the code and issues on Github speak for themselves. I still haven't wrapped my head around the concepts of tags, substags, metatags, parts and multiple excerpts. And yet… it works! And I probably will even keep using it on my personal instance.

I'm currently in the process of submitting it to the official app catalog. We'll see if and when that works out…

Update: After more than a week, my certificate request pull request was finally merged, enabling me to submit the app. So you can now find it in the Search category of installable apps on your Nextcloud instance! I also spent some time automating E2E tests and app release with Github Actions and actually implementing PostgreSQL support (based on those tests, since my Nextcloud uses MySQL).

Frontend

Now that I had a working full text search backend, I was able to try and actually use it - after what felt like an eternity of indexing. But it turns out the front end for full text search in Nextcloud is suprisingly… rough? Every result is prefixed with "(files)" to indicate the content provider it belongs to. In a fully localized application, that feels out of place.

When the excerpt the (my) platform app provides is too long for the available screen space, it's truncated in the middle, where the searched term is most likely to show up. So maybe my understanding what an excerpt should be differs from what Elasticsearch provides…

And that's only in the webinterface. I opened a bug report to the Android app because depending on a server setting, it opens search results in the browser instead of the app itself. Ironically, the setting in question is called "Open files directly from search results". Luckily, I can just turn it off, and everything works as it should.

Conclusion

All in all, the full text search experience with Nextcloud isn't great, but it seems to work without requiring additional software. Now I just need to export everything I put exclusively into Paperless-ngx in the last few months to my old folder structure in order to really compare the two solutions.