Benchmarks for evaluating intent-based chatbots

2 min readDec 30, 2022

Given the growing importance of bots in all aspects of our digital live (including in software engineering!) and the well-known challenges to test any type of NLP intensive bot, we could definitely use a series of de facto standard datasets to soundly evaluate and compare different NLP libraries and frameworks.

Note that, as we aim to check the quality of the intent-matching and NER detection components, we cannot just use raw NLP datasets. We need datasets that include:

The user utterance
The intent that should be matched given that utterance
The list of entities that should be identified in that utterance

Even better if the dataset already comes with different sets of data for training, testing and validation so that different authors/vendors can more precisely replicate and report the evaluation results they get when evaluating a given library.

It turns out that given the previous constraints, there is not that much to choose from. We have collected all the NLP datastes for chatbots we know of in this GitHub repository: Awesome NLP benchmarks for intent-based chatbots (we already saw that GH was often used to host awesome lists and that some of them became in fact part of the top starred GH projects 😉 )

Hope you can suggest others to add to the list, especially some that not only cover the intention part but also the entity recognition one as this is always a challenging part but a critical one if you want to build a chatbot that offers a good user experience (instead of one that needs to continuously asks you for every single piece of information is missing).

Benchmarks for evaluating intent-based chatbots

Written by Jordi Cabot