Hello, world

Classifying textual data is one of the most important tasks in natural language processing (NLP) and has various applications, including sentiment analysis, spam detection, and topic modeling. However, working with textual data is challenging compared to numerical data due to the unstructured nature of text. Unlike numbers, text lacks a fixed format, making it harder to extract meaning and classify it.

With the rise of chatGPT, it seems like we have a new tool at hand that can interpret text. The underlying technology, called transformers, was first described in 2017 in the Google paper “Attention is all you need” [https://arxiv.org/abs/1706.03762]. It projects words into a semantic vector space – this concept exists already for much longer – such that their relative positions correspond to their relative meanings. Take the following example. If we have a semantic space of dimension two, then the words “king”, “queen”, “man” and “women” would be related like this:

There are three reasons why transformer-based language models have the world stunned and AI legends preaching that we have to ethically design these systems to not get extinct.

  • an attention mechanism learns how words in a text influence each other, which makes learning long term dependencies possible
  • the architecture is parallelizable, such that vast amounts of data can be processed in reasonable time
  • new training techniques have made it possible to train these models on unlabeled data, making virtually all existing textual data candidate training data (scientific papers, websites, movie scripts…)

Where text used to be classified by structural features, such as the length of the text, the presence of specific words or phrases, and the frequency of specific characters or symbols, it can now be classified based on its meaning. In the next section, we discuss different methods of classifying textual data by using open-source language models of Hugging Face.

Semantic text classification

The strength of language models lies in their ability to transfer knowledge from their huge training datasets to your use case. Let’s say you have some text to classify, but there’s no labeled data and you don’t even know what the possible classes are. By clustering the text in semantic space, you can classify related texts together. You can try this by asking chatGPT in one query to generate 10 random animals, and in another to classify those animals in two classes. In essence, there is happening a semantic clustering under the hood.

When you do know class labels, you can use distance metrics in semantic space to classify text. If you ask chatGPT whether a horse is a mammal or an amphibian, it will correctly answer mammal because it knows from training how horses, mammals and amphibians relate. This is called zero-shot text classification [https://huggingface.co/tasks/zero-shot-classification]. It also allows you to describe your classes and e.g. classify baker as “a job for which you have to wake up early” and bartender as “a job for which you can wake up late”. This is nice for straightforward questions, but will not perform well for more complex tasks, when domain knowledge is required or when you use smaller language models.

You can leverage a language model’s deeper understanding to tackle more complex use cases by providing extra information in the text prompt itself. If you give the following prompt to chatGPT: “If a horse is A and a frog is B, what is a zebra?” it will correctly say A, because the model knows that horses and zebras are more similar than frogs and zebras, and horses are labeled A. This is called “in-context learning” and is very useful for structuring unstructured text based on a single or a few examples, e.g. for creating SQL queries from questions. However, it makes inference a bit slower, because the more examples you give, the more text has to be processed every time by the language model.

Now let’s say you have more subjective classes, like good beers and bad beers. With only a couple examples (about ten per class), language models can be finetuned to make the embeddings of the good beers closer to each other and further from those for bad beers. In combination with a simple classifier, you only need a few examples of good beers and bad beers to know that the good ones are Belgian. This is called few-shot text classification [https://huggingface.co/blog/setfit]. It makes it possible to handcraft your dataset to tune your classifier.

If there is sufficiently labeled data (say 1000 instances per class), it is still desirable to build a proper classifier by finetuning a language model [https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt#fine-tuning-a-model-with-the-trainer-api] thoroughly and train an advanced classifier on it. If you do it well, this will give the best performance.

Personal learnings

Recently, I’ve been working for a text classification feature at a client. I started with a zero-shot classification pipeline, but it quickly became clear that tuning class descriptions was a cumbersome and manual process and not useful to build a good framework upon. The results of the zero-shot classification, reviewed by a business user, turned out to be valuable examples, so I opted for a few-shot learning approach after that. In total, I’ve been spending triple the anticipated time for this feature. To avoid you doing the same, here are some take-aways:

  1. Expectation management: language model web interfaces like ChatGPT and Hugging Face chat are nice demonstrators of the power of language models. However, they are still machine learning models. They require careful integration in your application, well-defined tasks, and lots of testing. If you use them to trigger business users, be sure to temper expectations too. If you got triggered by them yourself, be sure to research the subject well and look into (state-of-the-art) alternatives. It sucks to work on something for months, to realize an easy solution was already at hand.
  2. Hybrid solutions: during iterations with the business users, I noticed the subjectivity of our classes and how some misclassification more costly than others (think of diagnosing an ill person as healthy or judging a fragile bridge to be stable). There is only so much you can do with supervised learning. Make sure to include whitelist and blacklist capabilities, such that you can incorporate remarks in future iterations (account manager is not a manager in the finance department, but an employee in the sales department).
  3. Think ahead: you will spend the least time on a task like this by taking into account that you will spend some time on a task like this:
  • Do your research. You will find that a lot of your ideas have been tried before and often you can even find code to start from. Also, if you use blackbox components like language models, make sure to know what data they have been trained on, such that you don’t waste time with an English model for a multilanguage task.
  • Structure your code. It’s a data science project so split your repo in a data, models and code section. Clearly version and describe your data. Keep model cards that describe the specific model as well as its training data if used. Of course, track your code history with a tool like git.
  • Identify KPIs together with business users that assess your model performance and are understandable for them. Discuss the use case. Involve the business users in a thought experiment: “If I could provide you with a model that classifies your text with an accuracy of, say, 90%, what would you do with it?” It might show you right from the bat that the model won’t be used because of business restrictions.
  • Anticipate and incorporate feedback. If you don’t already have labeled data, you will be needing it for evaluating your model. You can let business users review the results, but you would do well by incorporating the results for next iterations (of course, make sure not to test on your training set).
  1. Data quality: first of all, it sucks to label data. As I said, you’ll need data at least to assess your model. Make sure that you only have to label once. Choose classes wisely and with the right level of granularity. Check literature for existing frameworks. Your text classification task is probably older than you think and frameworks have been developed by people with plenty of domain knowledge – for job classification, think for instance of the European Standard Classification of Occupations (ESCO). Further, balance your examples to have a good view on all class accuracies and for balanced training if you’re doing few shot classification. If you can’t balance the data, use specific training methods for unbalanced data, like a weighted cost function.