The Software Side of Robotics

This episode of the TWIML AI Podcast features Nikita Rudin, co-founder and CEO of Flexion Robotics, a Switzerland-based startup working at the intersection of AI and robotics. What’s notable is that Flexion isn’t a hardware company - they’re focused on software and integration, partnering with hardware manufacturers to bring intelligence to physical robots.

The conversation provides a clear breakdown of how modern robotic AI systems are structured, and it’s one of the most informative explanations I’ve come across.

The Hierarchy of Models

Flexion’s approach uses a hierarchy of AI models, each handling a different level of abstraction:

1. Vision-Language Models (VLMs) for Task Planning

At the top level, they use off-the-shelf VLMs - models like Gemini or GPT-5 - for high-level task orchestration. When you give a robot a command like “open the fridge and get me a bottle of water,” the VLM breaks it down into granular sub-tasks:

This task decomposition is where general-purpose AI models excel.

2. Vision-Language-Action (VLA) Models

Once you have these smaller tasks, they become inputs to a custom-trained Vision-Language-Action model. This is where Flexion’s proprietary work comes in. Given a specific task, the VLA model predicts the sequence of movements a robotic instrument must perform to complete it.

3. Low-Level Controllers

At the bottom of the hierarchy are whole-body trackers and locomotion controllers that translate high-level movement commands into actual motor signals.

The Sim-to-Real Gap

Here’s where it gets challenging. The VLA models are trained using simulation data - 3D environments where robots can practice tasks millions of times without physical wear or safety concerns. But the real world is messy.

The “sim-to-real gap” refers to the differences between simulated environments and actual physical reality. Lighting varies, surfaces have unexpected textures, objects don’t behave exactly as modeled. This perception gap remains one of the core challenges in deploying trained models to physical robots.

Flexion is also exploring “real-to-sim” approaches - using real-world data to refine simulation parameters and achieve higher fidelity training.

The Hot Take: No Valuable Work Yet

One of the more provocative claims in this episode: none of the humanoid robots to date actually perform valuable work. This means work done autonomously, without hindering humans, while coexisting safely in the same environment.

This feels like an insider perspective, and it rings true. What we’ve seen so far are impressive demos and a lot of hype. Progress is being made, but there’s a significant gap between choreographed demonstrations and robots that can be deployed for real productive tasks.

Hardware-Software Partnership

What struck me most was how Flexion approaches the hardware question. Rather than trying to build robots themselves, they leave that to the hardware experts. Instead, they focus on building software that integrates tightly with existing hardware platforms - forming partnerships and close collaborations with equipment manufacturers.

This division of labor makes sense. The robotics space is complex enough without trying to solve both hardware and software challenges simultaneously.

The Takeaway

Getting anything to work in the physical world is fundamentally harder than getting a model to operate a browser or generate text. The variability and randomness in real environments is immense. But the hierarchical approach - using general-purpose VLMs for planning, custom VLA models for action, and specialized controllers for execution - provides a clear framework for how the field is tackling these challenges.

We’re not there yet, but the path forward is becoming clearer.