........
What happens when you combine the mind of a Nao robot with a Teddy bear ?
Well, you get this :
More precise, Teddy is able to :
The downside is that everything it's speaker dependent. Teddy is quite good listening to my voice commands, but will not be so complaint if someone else is speaking.
To make Teddy listen to other people, I've added an infrared distance sensor from Pololu on the side of the hat. It has double function: when you move your hand in front of the sensor it triggers the conversation mode, if you keep your hand there for 10 seconds, the Raspberry PI shutdown.
Well, you get this :
Teddy is using a Raspberry PI board, a Respeaker 4 microphone array, a Maxbotics sonar and a infrared distance sensor. The object recognition, speech and conversational abilities are provided by Google Cloud and Azure Text to Speech (for Romanian language).
Why ?
Nao is really cute, but 9000 USD per piece might be a little too much. We aimed to get the same smart features (no animatronics, yet) for as little as 90 USD (for the light version), or 180 USD for the full version, with the sonar and infrared sensors.
What ?
More precise, Teddy is able to :
- react when an object is nearby
- recognise the object and speak out what it sees
- react to voice commands
- sustain a conversation
How ?
React when an object is nearby
The light version (the one we've initially started) is using only a Raspberry PI, a Raspberry camera and a USB microphone. Total cost for the hardware is less than 90 USD. It goes pretty well, but there is room for improvement. One of the challenges for the light version is being able to detect when an object is nearby and triggering the recognition process. Having only the web camera available, we compute changes between video frames and trigger the recognition process when things are moving. It goes pretty well, but there are some edge cases when it fails (such as lot of people moving in the room, or slowly moving the object in from of the camera).
To alleviate this, we added a Maxbotics sonar which measures the distance to the first object in from of the bear. When we have a close enough object, the recognition process is triggered.
Recognise the object and speak out what it sees
This is actually the easiest part. It's as easy as taking a photo with the camera, posting it to vision.googleapis.com, receiving the json with the objects in the pictures, sending the json to translation.googleapis.com to get it translated to the language we want, and one final POST to texttospeech.googleapis.com (or Azure cloud TTS service for Romanian) to get the MP3 file with the speech.React to voice commands
This was actually one of the hardest parts of the project. I needed to process everything locally, so I looked after some hot-word processing library. Porcupine was the most promising in terms of being speaker independent, only it was commercial and the price was not upfront (you need to send them email and negociate), so I chose Snowboy instead. Using it is really really easy. You just provide three wav recordings with your voice command, and the Snowboy cloud builds a model that you integrate in your device code. The model does not need to go to server to do the actual recognition, everything is offline.
The downside is that everything it's speaker dependent. Teddy is quite good listening to my voice commands, but will not be so complaint if someone else is speaking.
To make Teddy listen to other people, I've added an infrared distance sensor from Pololu on the side of the hat. It has double function: when you move your hand in front of the sensor it triggers the conversation mode, if you keep your hand there for 10 seconds, the Raspberry PI shutdown.
Conversation
The conversation feature is based on the Google Assistant libraries. While it's really easy to add Google Assistant on Raspberry PI, the standard process has a limitation which I did not like: you need to start the process saying "Hello Google !". I decided it's a little weird to address Teddy as "Hey Google !", so I've used Snowboy once more. The flow is as below.
Snowboy is called in a custom Python handler and waits until "Hello Teddy" is heard. Whatever sounds follows are recorded as wav (until 2 seconds of silence), and the wav is then sent to Google Assistant Cloud using the standard Google libraries. The MP3 reply is simply played on the Raspberry.
The code
is on my Github, under MIT license, feel free to build a Teddy just for you :)
Many thanks to
- Florin Gheorghe, my partner in crime on this project
- Cegeka, who fully supported this project
Comments
Post a Comment