For my final year project at uni I decided to write an interactive question answering chatbot, which I lovingly named IQABOT. Though the project ultimately received a First, I think I could have planned better had I known a little more about what writing a question answering chatbot involved.
I’ll start by talking about my ideas and how insane they first were, then move on to resources I used and pitfalls to avoid. You can also find attached the Final Report that I submitted for the project. I’ll include some bibtex stuff in case any one decides to reference the paper but for the most part you should go through the bibliography and read the papers I’ve used. So let’s begin.
Planning to Fail
We are all familiar with the saying “failing to plan is planning to fail.”. Never could I have thought it so true than when it came to planning my final year project at King’s College London under the excellent supervision of Dr. Jonathan Ginzburg.
As a natural born geek (read: fan of Star Trek) I had always been a little excited about human-computer interaction using natural language. In fact I wrote my first programme on a childs laptop (I was that child) that came with some flavour of BASIC. The programme would return messages depending on what the user typed and all-in-all was a bit pants.
Anyway, a few years and many Star Trek movies later I decided to undertake the Interactive Question Answering project as posted by Dr. Ginzburg. With all the excitement of a dog at Crufts (they it like right? Otherwise why would they keep coming back) I set about reading papers on question answering, chatbots and natural language processing in general. Having read enough papers to light a decent sized bonfire I began drawing up sketches for a complete interactive question answering programme.
After a swift kick in the head from reality I quickly realised that I had nowhere near enough time to do the whole system myself. Worse still, I’d already wasted almost 2 months.
Don’t get too excited and try to do it all. Plan your project around the most important aspect of your projects title. In my case ‘interactive’.
With this in mind I recycled a truck load of papers and set about looking at what made my system interactive. So now I had significantly reduced my workload to something feasible and luckily had done enough research on the interactive element to not have completely wasted those first few months.
Natural language processing (NLP) has a lot of “hot topics” including sentiment analysis, speech recognition, automatic translation, (interactive) question answering, explanation generation and I’m sure a lot more.
And precisely because NLP has so much interesting research going on you will find a wealth of resources available to you. I will make mention of those I used but you can find alternatives in the included report and by searching the web of course.
The Natural Language Tool Kit
First up is the Natural Language Tool Kit (NLTK). NLTK is a swiss-army programming suite for those who wish to dive in to natural language processing. The libraries are written in Python (more on that later) and everything is open-sourced. As if this wasn’t enough there is also an NLTK book which is available for free online but purchase one (or donate to the project) to help fund further efforts.
However, ne’er let it be said that my voice sings out praise alone. NLTK is not a commercial product and so lacks the rounded edges that might help novices feel safer when using it. This isn’t too much of a problem though and shouldn’t put off anyone who is eager to learn.
Another problem stems from the fact that some code hasn’t been included in the book or online. One problem I faced was trying to find out precisely what corpora and features had been used to train the default named-entity taggers. Posts in the NLTK users group suggest that this may be remedied when the contributors have the time.
The third and final issue with NLTK is solely to do with the book and the noticeable lack of answers for the exercises contained within. Posts on the user group suggest that “official” answers may never be provided. But chances are that if you join the user group and search previous posts you will find the answer you seek.
TrueKnowledge & START
Although I would be writing the code for the interactive parts of IQABOT I needed existing programmes to provide the answers. My designs for IQABOT would allow for answers to be provided by multiple question-answer services. Originally I had intended to implement some redundancy so that when one service didn’t know the answer another could be queried without user intervention. I was again short on time and the XML service for START was still in its experimental phase so I only got to use TrueKnowledge which was very good but still had some way to go before catching up with STARTs vast knowledge.
Python has ruined other programming languages for me. You cannot know the joy of programming in Python until you have tried it yourself.
Choosing the right language for your project is incredibly important. I whittled my choices down to Python and Java and after reading about Pythons strength with regards to text processing my choice was made. There is an example somewhere on the net that demonstrates some text processing functionality written in Python and Java. The Python code took up 1 line whereas the equivalent Java code was somewhere in the region of 5 or 6 lines. Not a huge difference, but as your code begins to grow to thousands of lines you will be glad to be writing in Python.
I used Python 2.6 for the project as NLTK had not yet (and at the time of writing still hasn’t) migrated to Python 3.0. NLTKs authors have said that at some point in the future NLTK will be migrated though so we should be able to take advantage of some of version 3.0s niceties.
Some of you may be familiar with AIML (the Artificial Intelligence Markup Language). For those that aren’t; it’s an XML-compliant markup language specifically designed for use with with A.L.I.C.E. A very advanced chatterbot that won several prizes. Unfortunately I wasn’t able to get the Python implementation of AIML (PyAIML) to work before project deadline so didn’t have a chance to play with it but Dr. Suresh Manandhar published results that suggested his system, YourQA, made good use of AIML.
That’s all I have for now. I’m attaching – quite what that means I’m not entirely sure – my report for the IQABOT project to this post in the hope that it might help someone out there who is getting started with interactive question answering. It is a fun and challenging area and I hope you have as much fun, working on a programme, as I did.
Here is the report