Friday, July 21, 2017

Intents, Slots and SHRDLU

Several things came together, lately. I read up on the chatbot blogs that flourish. For chatbots the concept of the intent of the user is important. The intent of a sentence is what the user means by it. But what struck me, even if it hadn't been said with so many words, is that the number of intents is restricted. Remember my examination of Siri's commands? These are typical intents, and their number is limited.



At first I was bored by this, because I want my AI to have unlimited possibilities. But then I saw its strength.

As it happened, I was looking for the best way to design an NLI (natural language interface), and for a definition of when an NLI is "complete". When is the design finished? This is an important question when you are building an NLI.

And I read Winograd and Flores' 1987 book "Understanding Computers and Cognition". A book that still needs to sink in with me, and which at first was actually pretty demotivating because it points at the limits of AI. But I already took one lesson home from it: computers are tools. They don't really understand anything. Everything needs to be told. You can't expect a computer to "get it" and continue on its own.

These things, together, brought me the idea of a finite set of intents, semantic structures that act like a set of services in an API. And that these intents are the place where the design of an NLI starts.

Designing from Intents

An NLI has these three layers:
  • Syntax
  • Semantics
  • Database language
When you start to design an NLI you could start out by defining which aspects of syntax you would like to support. But this leads quickly to over-engineering. Because the syntax, while complicated, is the easy part. Every part of the syntax is an input to the other parts of the system. What's more, syntax is never "complete". More syntax is always better.

But when you take semantics as a starting point, or more precise, a fixed set of intents, this changes. Taking this entrance, you ask yourself: what are the things the user actually wants from the database? And this turns out to be surprisingly limited.

Once the intents have been defined, syntax follows. Only syntactic structures that will actually be used need to be implemented. Also, a single intent can have multiple syntactic representations.

The Intents of Alexa

There are several commercial agents you can talk to these days, and that allow you to extend their standard functionality with custom code.

Amazon's Alexa (featured on devices like the Amazon Echo) allows you to create skills. A skill is a conversation subject, like the weather, movies, planning a trip, or whatever.

A skill consists of intents and utterances. An intent is a "lambda function" with a name and arguments. The arguments are called slots. Example intents and slots for the weather skill: WEATHER(date, location), RAIN(date, location), BARBEQUE(date). The functions can be written in several different languages, like JavaScript and Python.

While skills are defined on the semantic level, the syntactic level is defined by utterances. Each intent can have multiple utterances. Here are some utterances for RAIN(date, location): "is it going to rain {date}" "will it rain {date} in {location}" "do I need an umbrella?". Notice the slots. Slots have slot types. The slots named here have the date and location types.



If you want to read more about it, Amazon has detailed descriptions here. It is a lot of work! For a fun and friendly introduction, read Liz Rice her blogs.

The Intents of SHRDLU

I went on to test the idea of starting with intents. And what better place to start than the most complex NLI every built: SHRDLU. SHRDLU, built by Terry Winograd in 1969, expresses many NLI characteristics that have rarely been seen in later systems:
  • Questions about the domain model of the system: "Can a pyramid support a pyramid"
  • Question about the system's history and its decisions: "Why did you clear off that cube"
  • Introduction of new concepts: "A 'steeple' is a stack which contains two green cubes and a pyramid"
But back to intents. I took the example dialog from the book "Understanding Natural Language" by Winograd and tried to create an impression of what the intents of the sentences would be.

1. pick up a big red block
   KB:PICKUP!(Object) 
2. grasp the pyramid
   KB:PICKUP!(Object)
3. find a block which is taller than the one you are holding and put it into the box
   KB:FIND?(Object) && KB:PUTIN!(Object, Object)
4. what does the box contain?
   KB:WHAT?(ObjectB) && KB:CONTAIN(ObjectA, ObjectB)
5. what is the pyramid supported by?
   KB:WHAT?(ObjectA) && KB:SUPPORT(ObjectA, ObjectB)
6. how many blocks are not in the box?
   KB:HOW_MANY?(ObjectB) && !KB:CONTAIN(ObjectA, ObjectB)
7. is at least one of them narrower than the one which I told you to pick up?
   KB:PRESENT?(ObjectA) && KB:NARROWER(ObjectA, ObjectB)
8. is it supported?
   KB:SUPPORT?(Object)
9. can the table pick up blocks?
   DM:CAN?(P) && P=KB:PICKUP(Object)
10. can a pyramid be supported by a block?
    DM:CAN?(P) && P=KB:SUPPORT(Object, Object)
11. can a pyramid support a pyramid?
    DM:CAN?(P) && P=KB:SUPPORT(Object, Object)
12. stack up two pyramids
    KB:STACKUP!(Object, Object)
13. the blue pyramid is mine
    UM:OWN(Person, Object)
14. I own blocks which are not red, but I don't own anything which supports a pyramid
    UM:OWN(Person, Object) && UM:OWN(Person, Object)
15. do I own the box?
    UM:OWN?(Person, Object)
16. do I own anything in the box?
    UM:OWN?(Person, Object)
17. will you please stack up both of the red blocks and either a green cube or a pyramid?
    KB:STACKUP!(Object, Object)
18. which cube is sitting on the table?
    KB:WHICH?(ObjectB) && KB:SUPPORT(ObjectA, ObjectB)
19. is there a large block behind a pyramid?
    KB:PRESENT?(Object) && KB:BEHIND(Object, Location)
20. put a small one onto the green cube which supports a pyramid
    KB:PUTON!(Object, Object)
21. put the littlest pyramid on top of it
    KB:PUTON!(Object, Object)
22. how many things are on top of green cubes?
    KB:HOW_MANY?(ObjectB) && KB:SUPPORT(ObjectA, ObjectB)
23. had you touched any pyramid before you put the green one on the little cube?
    TM:TOUCH?(Person, Object) && TM:AT(Time)
24. when did you pick it up?
    TM:WHEN?(E) && E=KB:PICKUP(Object)
25. why?
    TM:WHY?(E)
26. why did you do that?
    TM:WHY?(E) && E=TM:DO()
27. why did you clear off that cube?
    TM:WHY?(E) && E=KB:CLEAR_OFF(Object)
28. why did you do that?
    TM:WHY?(E) && E=TM:DO()
29. why did you do that?
    TM:WHY?(E) && E=TM:DO()
30. how did you do it?
    TM:HOW?(E) && E=TM:DO()
31. how many objects did you touch while you where doing it?
    KB:HOW_MANY?(Object) && TM:TOUCH(Object) && TM:AT(Time)
32. what did the red cube support before you started to clean it off?
    KB:WHAT?(Object) && KB:SUPPORT(Object, Object) && TM:AT(Time)
33. there were five blocks to the left of the box then.
    KB:PRESENT(Object, Location) && TM:AT(Time)
34. put the blue pyramid on the block in the box
    KB:PUTON!(Object, Object)
35. is there anything which is bigger than every pyramid but is not as wide as the thing that supports it?
    KB:PRESENT?(ObjectA) && KB:BIGGER(ObjectA, ObjectB) && !KB:WIDE(ObjectA, ObjectC) && KB:SUPPORT(ObjectA, ObjectC)
36. does a steeple
    ---
37. a "steeple" is a stack which contains two green cubes and a pyramid
    DM:DEFINE!(Word, Object)
38. are there any steeples now?
    KB:PRESENT?(Object, Time)
39. build one
    KB:BUILD!(Object)
40. call the biggest block "superblock"
    DM:DEFINE!(Word, Object)
41. have you picked up superblock since we began?
    TM:PICKEDUP?(Object) && TM:AT(Time)
42. why did you drop it?
    TM:WHY?(E) && E=KB:DROP(Object)
43. is there anything to the right of the red pyramid?
    KB:PRESENT?(Object, Location)
44. thank you
    SM:THANK_YOU()

Here's a movie of such a SHRDLU interaction.



Many of the intents here are similar. I will now collect them, and add comments. But before I do, I must describe the prefixes I used for the modules involved:
  • KB: knowledge base, the primary database
  • DM: domain model, contains meta knowledge about the domain
  • UM: a model specific to the current user
  • TM: task manager
  • SM: smalltalk
The number of intents of a system should be as small as possible, to avoid code duplication. It should also be not smaller than that, to avoid conditional structures in the intents. I think an intent should not only be about the lambda function, but also about the type of output the user expects.

These then are the aspects of an NLI intent:

  • a precondition: where a simple invocation name is sufficient to wake up a skill, this will not do for a more complex NLI. A semantic condition is necessary.
  • a procedure: where a hardcoded function (the lambda function) is sufficient for a chatbot, an NLI should use parameterized procedures in a language like Datalog. Each step of the procedure must be manageable by a task manager.
  • an output pattern: different types of questions require different types of answers. It is possible that two intents only differ in the responses they give.

Verbs can be the syntactic representation of intents, but "why" can be an intent as well. I use the suffixes ! for commands, and ? for questions.

The KB (knowledge base) intents:
  • PICKUP!(Object): bool
  • STACKUP(Object, Object): bool
  • PUTON!(Object, Location) / PUTIN!(Object, Location): bool
    for these intents different procedures are activated that include a physical act that takes time. the system responds with an acknowledgement, or statement of impossibility
  • BUILD!(Object): bool
    build a compound object
  • FIND?(Object): Object
    search the space for an object with specifications. the object is used for another intent
  • PRESENT?(Object): bool
    like FIND?, the system responds with yes or no
  • WHAT?(Object): object
    like FIND?, but the system responds with a description that distinguishes the object from others
  • WHICH?(Object): object
    like WHAT?, but comes with a specific class of objects
  • HOW_MANY?(Object): number
    this intent involves counting objects, whereas the others just deal with a single instance. returns a number
  • SUPPORT?(Object): bool
    just checks if a certain declaration exists. many of these could exist, and perhaps they can be combined to a single intent PRESENT?(Predication)
The DM (domain model) intents:
  • CAN?(Predication)
    checks key points of the predication with a set of allowed interactions in the domain model
  • DEFINE!(Word, Object)
    maps a word onto a semantic structure that defines it
The UM (user model) intents:
  • OWN(Person, Object): bool
    mark the ownership relation between the user and the object
  • OWN?(Person, Object): bool
    checks if the ownership relation exists
The TM (task manager) intents:
  • TOUCH?(Person, Object)
    touch is an abstract verb that includes pick up, stack on, etc. It has a time frame.
  • PICKEDUP?(Object)
    checks if PICKUP was used
  • WHEN?(Event)
    returns a time index. It is described to the user as the event that happened then
  • WHY?(Event)
    returns the active goal of an event
  • WHY?(Goal)
    returns the parent goal of an goal. the top level goal is "because you asked me to"
The SM (smalltalk) intents:
  • THANK_YOU()
    there is no meaning involved, just a canned response

Closing

I know I sound very deliberate in this post, that I know exactly what's going on and how it should be. That's just my way of trying to make sense of the world. In fact I just stumbled on this approach to NLI and its very new to me. Also my analysis of SHRDLU's "intents", which it never had, may be completely off track. One would only know if one tried to build a new SHRDLU around this concept.

I think chatbots form a great way to get involved in natural language processing.

No comments:

Post a Comment

On SQLAlchemy

I've been using SQLAlchemy and reading about it for a few months now, and I don't get it. I don't mean I don't get SQLAlchem...