Over two years ago I wrote Thoughts On Voice Interfaces, a scattering of ideas about voice interactions. Time has passed, I’ve been working more on voice interfaces, and though products in the market have barely changed, I still have NEW THOUGHTS.
Large Language Models
I’ve only been exploring large language models for a couple months, and limited to GPT-3, but most of this post will be about LLMs.
I’ve included examples of prompts and outputs in this section. Click ⊕ Example
to see the examples, all screenshots of the GPT Playground.
- With the addition of Large Language Models (LLMs), especially GPT-3, I believe everything will change. But not quickly, there’s a huge amount of implementation and experimentation and finding and fixing problems.
- Existing systems are based on intent categorization and entity extraction:
- You have a bunch of commands, the commands take certain kinds of arguments (song title, time span, search query, etc).
- Any utterance is compared to examples and categorized as a command, and the arguments (“entities”) are extracted.
- This isn’t just an implementation detail, but also forms the basic shape of voice interactions. Voice interactions work the way they do because this is the technology we have (until now). This command-based model is not begging for LLMs, though it could probably increase accuracy.
- If you plug voice into a GPT-3-driven interaction you’ll find it’s very sensitive to speech recognition inaccuracies.
- In the traditional categorization model it will naturally ignore many inaccuracies because it’s just trying to find the best match among a finite set of commands and examples.
- That is: the traditional model tries very hard to keep the conversation on the rails, and one inaccuracy won’t make you jump to another rail.
- The beauty of an LLM is that we might not need rails, that we can make vastly more capable agents that can operate on a much wider set of input.
- The danger of LLMs is they take things very literally, and are very credulous. You want to know what’s next on my caliper? It will try very hard to come up with a creative answer instead of figuring out you were asking about your calendar.
- I’m sure this isn’t a fatal flaw, but it’s a challenge right now.
- LLMs are stereotyped as text generators, but they are also great at understanding. Two years ago I complained there’s no “understanding” in “natural language understanding”. That will change.
- We still have to remind ourselves that what it’s doing is not “understanding” as we know it; when an LLM responds to something there isn’t an entity underneath that is understanding things.
- The understanding is when we fuse that a person’s input with meaningful output or action. When we ground the model in some purpose.
- Prompt Engineering is an important part of that fusion: adding context to the input and interpreting the output. These shells we create are an embodiment of AI, what turns a mechanism (an LLM) into a meaningful entity.
- Some understanding will be inference: being able to contextualize and make reasonable assumptions about what the a person means. This is what people pay a lot of attention to when talking about “understanding,” though I think it’s over-emphasized. Example
- Much of understanding will be translation: taking our speech and translating it into an actionable form.
- We see this currently in GPT’s code generation. You can ask GPT for concrete answer but it’s also very good at writing down the steps by which you can get an answer. Example
- “The steps by which you can get an answer” is another way of saying “a program”. We will be programming our agents and getting our agents to generate programs. Example
- You won’t see the programs of course. Mostly. GPT is also pretty good at turning code into natural language. I expect this round-tripping will be useful in debugging our agents, something we’ll probably have to do individually from time to time. The same way we ask another person to repeat our instructions to make sure we’re clear, we may ask our AI agents to do the same. Example
- What’s the programming language that our personal agents will be generating based on our commands? Probably something new, probably accidental, maybe the product of implicit negotiations between developers and the large language models, each of which is going to try to satisfy the other.
- We still have to remind ourselves that what it’s doing is not “understanding” as we know it; when an LLM responds to something there isn’t an entity underneath that is understanding things.
- The LLM “prompt” is more than just a command for the LLM. It’s instruction and goal-setting and things we haven’t yet discovered. But it’s especially context.
- We haven’t been providing this context to intent parser or to speech recognition itself. What context we’ve been giving is mostly internal implementation details, not accessible to language understanding. That will have to change.
- Consider a case like sending a message to a contact: this often involves the difficult vocabulary of people’s names. We may solve this generally by personalizing models so they work better with individualized information like your list of contacts. But with an LLM we can include that information in the prompt or context, much more casually and contextually than most personalization. Example
- This represents an alternative to how we usually think about “learning” in voice interfaces, which is about frequency and probabilities. There’s a lot of ideas about how to predict what the user will do and then bias the result based on that, or change the interface to make the likely thing easier, or even to do something proactively for the user. This is not that!
- Armed with many predictive capabilities I think we’ve been projecting prediction onto human-to-human interactions where it doesn’t exist. Your favorite barista does not actually make the coffee before you arrive. And even if they ask “the regular?” they are picking up the thread of past interactions and past conversations, not just predicting what you’ll have. Prediction removes the autonomy and self-direction of the user. If there’s a predictive analog in human-to-human interaction it’s probably in how we stereotype each other and our respective roles, not in how we communicate.
- LLMs love context. In your conversation you’re building a transcript. It doesn’t stop you from wandering, but it’s always ready to build off that previous context.
- We’ll have to figure out when to forget and when to erase context. GPT has a fixation problem where it will sometimes persistently attend to an unimportant detail in a conversation. Example. And as we support longer conversations and relationships that happen over days and months, we will have to think about how to keep the model’s focus aligned with the user’s mental model and focus.
- Follow-up and refining commands will be much easier with LLMs. Maybe everything will feel a little like a chatbot, though for functional and not personality reasons. People will become used to refinements that we’ve been self-censoring because we don’t believe they are possible. Example
- A mind-bending part of building a chat interface directly on GPT-3 is that you have to create archetypes and entities from scratch.
- There is no “you” and “me”. “You” is a character that has to be introduced and described to the LLM. Armed with a description the LLM will then predict what this entity might do. The LLM doesn’t “pretend” to be “you” as there is no underlying personality. And “me” (the human) is again an introduced entity. Without constraints the LLM will gladly predict both sides of the conversation. Example
- Maybe there’s something cool that could be done by making it a conversation between me and me: that is, modeling the interaction on internal dialog. Example 1 Example 2
- It’s a fun party trick to make GPT-3 say things in the style of The Bible or Jar Jar Binks, but it’s more than that…
- Asking GPT to write something for a kindergartener is great. Maybe that should even be a standard starting instead of “normal” output. Example
- Asking it to write for speech output could work, though I haven’t had much success in my brief experiments. I don’t know what to ask for, and I’m not sure what “good for speech output” really means. Some people have had good success with SSML generation.
- Adjustments to the output are very amenable to user override: changing the tone to be more brief, emphasize certain points, etc. These kinds of adjustments can be done without introducing combinatorial complexity. Example
- Given the right chat context for an individual, and enough quantity, I’m guessing an LLM may naturally adopt the user’s phrasing and language style. Though I wonder if people naturally code switch to speak in a “formal” style with voice interfaces. (“Naturally” may be learned behavior to avoid speech recognition errors, or it might also be based on the mental model we have about our relationship with the AI agent.)
- I’m optimistic about the novel way you can compose text in ChatGPT, by acting more as an editor than an author, and how this will be applicable to voice…
- If ChatGPT writes something for you and you want to change it, the best way is usually to ask for the change. “Make it shorter,” or “take out the part about Jar Jar Binks.” Example
- This is also a good way to edit with your voice. Referring to specific sentences or paragraphs or describing mechanical changes is very hard to do with voice. It’s hard to compose those edits in your head, and it’s hard to say them accurately. GPT is quite good at editing text that is human generated. Example
- I think there are entirely new patterns of interaction possible with LLMs
- Instead of commanding, maybe you speak to your AI agent freely; the emphasis is just to dump a lot of stuff into it. Then you start using that accumulated information as the basis for action. Example
- Maybe it’s less transactional, less modal. You could be asking it to do something and then instead of pausing and waiting for it to do the thing you could just… keep going. The AI starts to assemble a plan for complicated actions, while still immediately acting on things that are easy and low-impact, though also respecting any higher-level instructions like ordering or conditionals. Example
- GPT and probably all LLMs are fairly expensive, more expensive than any of the current techniques.
- Hopefully this stuff gets cheaper, not because compute gets much cheaper but because smart people just figure things out. But there may be a hard lower bound on the cost.
- It’s possible to select different models and approaches in different contexts, most of them more efficient; I expect very eclectic stacks to emerge.
- When we start getting Large Language Models built on predicting text AND behavior I think we’ll see another burst of capability.
More Voice Thoughts
So obviously I’m personally pretty focused on the effect of LLMs, but there’s more to be said…
- Endpointing remains a major issue. That is, deciding when the person is “done” speaking and then acting on that speech.
- The idea of “done” itself imposes very specific ideas on how an interaction works.
- This whole “done” thing is imposed. Every system that uses silence-based endpointing is also capable of listening a little longer, with no real privacy issue. It feels a bit like a “lalala I’m not listening to you” situation, where it’s better just to not know that the user is continuing to speak.
- TTS output does make it difficult to leave the mic open. With the right physical device setup I assume it would be possible to quickly stop TTS when the original speaker continues to speak. Thoroughly designed and vertically integrated physical devices are not the norm, so this pattern doesn’t become the norm.
- I feel a sense of both optimism and disappointment about a well integrated physical device; that is a device where the microphone, speaker, physical controls, other input modalities (like an IMU), speech recognition, speech understanding, command execution, and output systems are all integrated for a more ideal experience. The potential is there at every level but hardware like this requires capital and the total experience requires vision. Will we have to wait for a big player to discover the vision, or for the hardware design to become more accessible for a new entrant? (I wish Rayban Stories were a more open environment!)
- Humans have many forms of non-verbal or at least non-speech ways to indicate a desire to continue or interrupt. This is also a point of frequent errors, and requires reading other people in a way that many people find difficult. Still I think there’s more room to mimic some of these patterns: the “uh” call to attention, the stare-at-the-sky please-wait gesture, the facial expression of pausing mid-sentence.
- Invocation remains a barrier to fluid voice interactions.
- Wake words and other voice invocations like special hardware gestures are only accessible when you have control of the OS.
- There aren’t many tools to prototype wake words or keyword spotting. This also gets in the way of using procedure words to control things like mic state or make markers in your voice transcript to correspond to other events. I still like the idea of procedure words but I’ve poor luck actually making them work.
- I’m divided on purely wake-word-based initiation and hardware buttons or other ways to initiate. Using a wake word is more socially appropriate because it informs any bystanders what is happening. And if you are going to talk, why not start it by talking? But people – sometimes including myself – seem to like other controls. It also opens up some possibility for multitasking with voice. If you are in a call you don’t want to use a wake word because the word will end up in the call… it can mute once you complete, but you’ve already said something weird.
- Speech recognition systems aren’t just “better” and “worse”, individual systems have different behavior.
- Though it’s seldom used, I like the idea of using spelling to handle difficult-to-recognize words. That is, you might say “tee oh oh” if you want to force the system to write “too” instead of “to”. But support for this is spotty and inconsistent. Systems don’t frequently distinguish between explicit cases like this or spoken punctuation, and implied text.
- Punctuation is also handled differently, some systems mostly emit just words, others try to punctuate everything. And somehow even spoken punctuation (e.g., saying “period” for “.”) is applied inconsistently in a single system.
- Whisper is an impressive speech recognizer but it’s not real time and not easy to apply to interactions.
- Right now speech recognition systems usually have partial and final transcriptions. We’ve all seen the partial recognition change as we continue to speak and earlier words are corrected given later context. These represent two levels of recognition. Whisper uses an even wider window of analysis to identify a sensible sentence given the input. Maybe we could have three levels: very fast partial transcriptions, ~1 second transcriptions, and then a ~4+ second “this time it’s really final” transcript? But seeing the transcript update even further back may be confusing or distracting for the user, and it may be hard to trust that errors in the not-quite-final transcription will get fixed with a little patience. And the programming is already hard to get right with just two versions of the transcript.
- Whisper makes retraining seem more accessible. Are there alternatives to current microphones that get enough information to create a transcription, even if they don’t get enough information to build an audio recording that other people could recognize? Like a microphone in earbuds, embedded in eyeglass temples, or something touching the nose bridge or neck… and instead of using tricks to make those microphones sound “normal,” train directly on the raw data they produce.
- There’s room for more voice consumption that doesn’t have any output at all, or immediately do anything.
- That isn’t just voice recorders. There are tools like Otter or Google Recorder that faithfully listen, but mostly just transcribe.
- I’ve imagined a process for recording family photos and documents with voice and photos, so that your photos are directly connected with those stories.
- Maybe a generalization: you can, without real time feedback, fuse voice with other input if you have a record of that other input, so that you can fix mistakes later instead of relying on feedback to fix mistakes during the interaction.
- I’m excited about the possibility for asymmetric I/O: speaking into the AI, but presenting the results visually.
- This would solve some of the all-or-nothing issue with intent parsing. It’s much easier to live update something visual so the computer can show its best-effort interpretation at any time.
- Screens represent a conversation between the human and the computer. The computer is showing what is possible (buttons and fields), what it knows (informational elements), the status of different inputs (cursors, dialog boxes). The standard GUI conversation with a computer is more like whiteboarding than a conversation.
- Just doing voice control of our screens hasn’t caught on. But these traditional GUI elements – buttons, etc – are specific to non-voice inputs. What would a GUI designed voice input look like? I genuinely don’t know!
- When phrased as an accessibility technology voice will be caught in this not-design-for-voice trap. “Accessibility” implies “small group of people,” and “not your real audience,” and “applicable to status quo software.”