Q&A: How video helps build robot brains for physical AI
Robots could well be the next trillion-dollar tech opportunity, in no small part thanks to AI. Not surprisingly, that’s led to race by a variety of robotics companies to build industrial and humanoid robots to help (or replace) humans.
And to help orient those devices visually in the real world, robot brains are being fed Youtube videos. The idea is to help them understand the environment in which they would work and to spur physical AI.
Kate Shen, co-founder of startup Anaxi Labs, is following a different approach to training robot brains. She is crowdsourcing and supplying videos of people performing tasks, which she then shares with robotics makers.
Human-scale video, she argues, is critical to train robots because it more accurately captures how robots should perform their tasks, depending on the circumstances around them. More broadly, the technique can also provide a clearer roadmap for physical AI.
With that in mind, Computerworld spoke recently with Shen about Anaxi Labs’ physical AI initiatives and how they differ from what other companies are doing.
Kate Shen, co-founder of startup Anaxi Labs.
Anaxi Labs
Tell me about your company and why you started it. “This is very much a … [Carnegie Mellon University] startup. We started this company [when] we realized that when it comes to AI-building [large language models] (LLMs), everybody knows that there are two things on the infra level, chips and data. The same things were happening to robotics as we moved from digital to physical AI.
“Except this time…, everybody is aware of [the] difficulty, everybody’s using infrastructure. But when it comes to data, we have to build the data infrastructure from scratch, because unlike LLM, the training data for robots can’t be from the internet.
“We realized that it would become a [barrier] sooner or later, and it will turn into a major, major industry. And that’s how we started the company.”
Isn’t physical AI data mostly collected from YouTube? What are you doing differently as a company? “You mentioned two approaches, one,using YouTube video, and two, using a simulation. And unfortunately, the two paths were [taken] back then because [of a] lack of better paths. The sheer volume of data needed to train physical AI far exceeds what’s available on the internet, and it needs physical interaction many, many times for each scenario [more] than can be found on YouTube.
“We realized, by talking to pretty much all the industry [players] since last year, [there is a] shift to egocentric, meaning like human-based training videos, data. We started investing heavily in building a world-scale data pipeline. We started working with industrial- dense regions…who usually have business covering multiple scenarios — for example, construction, logistics, and especially factory floors.
“And the second pipeline is, we can use [a] community model for this and tap into this worldwide [pool of] individuals, consumers who are wanting to upload videos for training purpose[s]. We’re launching, starting this summer, our data collection and annotation app.”
What exactly are you trying to collect from the videos? ”The data we collect is simply exactly the task our clients want their robots to do — [an] egocentric view, basically like the two hands in the video doing exactly the same thing, sorting the packages and [having] their barcode scanned. In general, there are about 20 general steps, most commonly seen in industrial factory floor settings, and we’re doing all of them. Increasingly, we’re seeing household scenarios, like cleaning the kitchen, cleaning up the bedroom.
“In order for the models to be able to understand [the videos], the second most important thing is annotation. At the early beginning, they only wanted segmentation, captioning and contact point[s].
“But now, in order to have the robot really understand the how and the why behind the scene, they’re increasingly demanding captioning in the format of almost like the chain of [thought].
“For example, a robot sees a slipper. And then we’re going to identify this is what happened, and then you’ve got to grip harder. And that’s the result.”
What is your assessment of physical AI, and how does it impact jobs? ”One is surrounding the safety, and the second one is [the] impact on [the] job market. As compared to LLM, in the early LLM days everybody just [got] as much data as possible from the internet. But [for] physical AI, when they place the order, there is a specific category called [failure] and recovery cases, meaning what if something goes wrong, what should the robot do in each scenario. This is a huge difference from the LLM days. Definitely, all the physical AI companies realized that, and they’re building this into their model since the beginning.
“[On jobs,] right now, at least at this stage, we’re seeing mostly the upside. There are a lot of small robotic companies making a lot of money by working with the companies affected by [labor shortages]. We’re seeing those demands coming from factories who are struggling with shortage of labor, factories who have a problem hiring because their tasks are too dangerous.”DeepSeek’s steep V4-Pro price cut escalates AI pricing war – ComputerworldRead More