It’s Black Friday. You are looking for gifts for your boomer parents. All they really want to be able to do is hear the Stones at will. So maybe get an Echo, or one of those Google Home things. Phil and Sharon love theirs, you just say “Alexa, order the Eagles remaster” and it shows up. You can check sports scores with it, you don’t even have to take out your phone.
Do not do this.
The temptation is to focus on the fact that the speaker is always listening, and that despite what they tell you initially, you have no control over who accesses the device. This is the obvious, surface level of fuckery. Here is what you can actually do with a network of smart speakers.
Smart speakers are on your home network. That means they can see traffic on your home network (although most web traffic is encrypted now, which only lets them see the domain you’re trying to access and certain metadata). They can also see the devices on your network – phones, laptops, routers. Those devices are now linkable with an Amazon (or whomever) account. That account has an address attached to it; in fact, an address history.
You don’t actually need to be on the network in order to see devices, though – smart speakers have their own wireless hardware. So even if a phone isn’t attached to the network, it is possible in principle to see that a phone is present within wifi or bluetooth range. You can also see the presence of nearby networks – your neighbors’ routers.
Inside the room, your smart speaker has a stereo microphone and stereo speakers. This is pretty cool – it gives the device sonar. This lets you map the room, and detect objects, their rough dimensions, and their movement across the room. Gait analysis is in scope, in principle, which gives you a weak biometric ID, age, weight, and sex. Given a stereo microphone, segmenting voices is also a pretty solved problem.
So at the end of this you have voice samples, linked to rough bodily dimensions and other basic biometrics, and the devices present on that body, along with the knowledge it’s linked to other such data points at a particular location, which is physically proximate to other particular locations. You can see in real time if anyone is home, and if they’re asleep yet. You can map friendships, roommates, families, and breakups, who talks with whom, who’s started hitting the gym, who came home with a limp.
We haven’t even gotten to the content layer yet, done a single active attack, or transcribed a single recording, and we have a social graph to make the Stasi blush.
But let’s go a bit deeper. If I know the devices on a network, I know some of them have vulnerabilities; some public and some nonpublic. If I’m willing to go a bit active, I can infect these devices, and end up with a pretty substantial group of devices under my control. Now I’m persisted even if you trash the speaker.
“This sounds scary, but it’s OK”, you say, “I trust Iron Jeff; he’s not going to cause the kind of PR damage this would entail to his own company”. Excellent; Jeff isn’t the guy you’re worried about, per se (although he should be). Do you know how many, to pick an example, Chinese citizens work in Silicon Valley? How many of them have the ability to ship Alexa code, or exfiltrate code to analyze for vulnerabilities? How many people, precisely, have access to code signing keys that let firmware be shipped on the sly? How many people are vulnerable to blackmail over visa status, or via family connections back in the old country? How many employees over the past month or so were politely requested at a border to log into their corporate accounts leaving or entering? It’s not actually clear that anyone at any of the companies in question keeps track of this; for HR purposes alone it’s probably not a good idea to be the maintainer of “Plausible red Chinese agents forbidden from accessing security projects – final updated revised.docx”.
“I don’t care if Alexa hears me taking a dump and watching The Game”, the boomers say. “I’m not talking about anything important anyway”. The thing is that the product isn’t just your data, it’s the network. Every additional speaker increases the risk to everyone who happens to be mapped by the network, not just the person who happens to own the device – a perverse extension of Metcalfe’s Law. We kill people based on metadata, after all. Maybe just get your parents some vinyl instead.