Getlabeltext.com is a cloud text classifier. Text classifiers are nothing more that statistical functions that accept text as input and product a label as output. To do that text classifiers rely on an internal “database” of text tokens and their frequencies. This “database” is built during the training phase and can be re-used for every run there after.

Training classifiers is time consuming and usually requires large amounts of data which means getlabeltext.com cannot be built as one big monolithic app. It needs to be broken down into smaller decoupled deployable pieces with varying cardinality e.g. the app should be able to run classifiers in parallel on multiple individual nodes.

There’s nothing groundbreaking with this idea, microservices have been around as a concept for almost (more?) a decade but still the architecture is not obvious. The classifier module/service depends on data to do its work but if it’s also responsible for managing them it will quickly turn into one big monolith in its own right. Users need to be able to create datasets, view them, alter them, delete them and so on, add these tasks to those of running and training classifiers and it’s obvious this is too much responsibility for a single microservice. Also, the classifier module is probably the one that will be running in multiple instance and it becomes obvious that data management must be provided by a different service.

How does such a service work though? Datasets (especially training data) will probably run in the gigabyte range, way larger than what a REST app normally can handle. My first instinct is to have a data microservice that will provide all the CRUD management services to users but also provide an API that exports data to a CSV file directly from Postgres that will be then be served by nginx.

Load dataset sequence
Load dataset sequence

Exporting data directly from Postgres should be the fastest way possible and the CSV format is perfectly usable from Python and Pandas e.g.:

COPY (SELECT * FROM training_data WHERE project_id = ?) TO '/var/www/data/ac1d32ce-e999-436a-8c8d-18e7d0850575.csv'

To make the file accessible to other services nginx with sendfile on; should be sufficient.

Will this work? Follow this space for updates ;)