Tutorial: Benchmark Your Chatbot on Watson, Dialogflow, Wit.ai and more

  • use multiple chatbot technologies
  • set up test automation in a few minutes
  • enjoy a new improved user interface
  • get the benefits of a hosted, free service

Top 4 Bot Tutorials

1. Best chatbot platforms to build a chatbot

2. Build simple ChatBot in Python with RASA — Part 1

3. Build a simple ChatBot in Python with RASA — Part 2

4. How I developed my own ‘learning’ chatbot in Python


  • Split the data into two parts. First part is used for training the artificial intelligence, the other part is used for testing the artificial intelligence.
  • To remove flakiness, do this several times and average the outcome.
  1. For each intent, remove some of the user examples and train a new NLU model
  2. Evaluate the removed user examples and compare the predicted intent to the expected intent
  3. Calculate precision, recall and F1 and average over all intents

K-Fold Cross-Validation

Image From Wikipedia
  1. For each intent, split the user examples into k pieces
  2. Use every piece except one for training a new NLU model
  3. Evaluate the remaining piece, compare predicted intent to expected intent, calculate precision, recall and F1 average
  4. Repeat 2 + 3 for another piece

Benchmarking With Botium

Prepare Botium Configuration

"botium": {
"Capabilities": {
"PROJECTNAME": "Botium Project wit.ai",
"WITAI_TOKEN": "...",
"WITAI_LANG": "en"

Prepare Your Data

> botium-cli nlpextract --config /pathto/watson.botium.json --convos /pathto/outputdirectory --verbose
Is it me or it is hot here?
What's the current temperature?
What’s the current temperature?

Benchmark Your Data

> botium-cli nlpanalytics k-fold -k 5 --config /pathto/watson.botium.json --convos /pathto/utterances/
############# Summary #############
K-Fold Round 1: Precision=0.7653 Recall=0.7708 F1-Score=0.7680
K-Fold Round 2: Precision=0.6958 Recall=0.6875 F1-Score=0.6916
K-Fold Round 3: Precision=0.7460 Recall=0.6806 F1-Score=0.7118
K-Fold Round 4: Precision=0.7361 Recall=0.7361 F1-Score=0.7361
K-Fold Round 5: Precision=0.6931 Recall=0.7083 F1-Score=0.7006
K-Fold Avg: Precision=0.7273 Recall=0.7167 F1-Score=0.7219

Looking for contributors

Don’t forget to give us your 👏 !




Co-Founder and CTO Botium🤓 — Guitarist 🎸 — 3xFather 🐣

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Graph Convolutional Networks for Geometric Deep Learning

AI Powered Search for Extra-terrestrial Intelligence — Deep Learning Signal Classifiers

Natural Language Processing Series

Ensemble Learning — a simple introduction

5 Reasons Why You Need Machine Learning in Software Applications

The Kinetics Dataset: Train and Evaluate Video Classification Models

Capturing Beauty — Hybrid Incremental Learning for AI-powered Tinder profile swiping : An…

All About Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Florian Treml

Florian Treml

Co-Founder and CTO Botium🤓 — Guitarist 🎸 — 3xFather 🐣