Building a Corpus? Go the Chilka Way

Give the power of a vector database to your corpus

January 09, 2025

455 words/3 min read

In your NLP and Machine Learning tasks, as also research, getting the right detail from your voluminous text data is very important. You need a convenient, programmatic interface to your textual data.

In research settings, if you have large chunks of text, and are short on time, you are forced to skim through the material that you wished to study in more detail. A way to organize the text that captures the essential metadata and provides convenient access, saves time and prevents missed spots.

The Chilka way of generating your corpus

You can call your organized text a corpus. Chilka is a corpus building framework with a pluggable document database backend. The module chilka specifies a simple interface via hooks. These ‘hooks’ call their corresponding implemented methods and classes in the plugins. A plugin implements methods to interface with a specific database. You can find Chilka on Github.

Chilka uses document databases for ease of storing and retrieving textual data. Popular databases such as MongoDB and ChromaDB which have differing characteristics can be interfaced with corresponding plugins with minimal changes to the client program. Using your own database is as simple as implementing your own plugin.

Using Chilka starts with the client program. The client adapts to the kind of data that you want to work with. It uses the Chilka module interface to build and manage the corpus. Note that the client needs a plugin to interface with its database. You can have as many client-plugin-database triads as you need. For instance, you might have a client-plugin pair to build a news story corpus around ChromaDB. You might have another pair to build a recipe corpus.

Features

Schema free: Chilka does not impose a specific schema on your corpus. For instance, the example MongoDB plugin implements a schema with sentence-level granularity. The tutorial example (Gutenberg jokes corpus) implements a ChromaDB plugin with chunk level granularity.
Versatile interface: Although Chilka does impose a certain interface on the client, it is flexible enough to allow implementation of specialized behaviour in a plugin. For instance, the example ChromaDB plugin implements a semantic search feature in the plugin while the MongoDB plugin does not.

What next?

Open up to a new world of managing your text with the Chilka corpus building framework.