Data Engineer
- Tokyo
- Partial Remote
- Full-time
- December 16, 2025
Description
About AIRoA
The AI Robot Association (AIRoA) is launching a groundbreaking initiative: collecting one million hours of humanoid robot operation data with hundreds of robots, and leveraging it to train the world’s most powerful Vision-Language-Action (VLA) models.
What makes AIRoA unique is not only the unprecedented scale of real-world data and humanoid platforms, but also our commitment to making everything open and accessible. We are building a shared “robot data ecosystem” where datasets, trained models, and benchmarks are available to everyone. Researchers around the world will be able to evaluate their models on standardized humanoid robots through our open evaluation platform.
For researchers, this means an opportunity to:
- Work on fundamental challenges in robotics and AI: multimodal learning, tactile-rich manipulation, sim-to-real transfer, and large-scale benchmarking.
- Access state-of-the-art infrastructure: hundreds of humanoid robots, GPU clusters, high-fidelity simulators, and a global-scale evaluation pipeline.
- Collaborate with leading experts across academia and industry, and publish results that will shape the next decade of robotics.
- Contribute to an initiative that will redefine the future of embodied AI—with all results made open to the world.
Key Responsibilities
You will play a critical role in building the data backbone powering next-generation robotics foundation models:
- Design and implement large-scale data pipelines that cover the full lifecycle of high-quality datasets for robotics foundation models—collection, processing, curation, and publishing.
- Design, build, and maintain data schemas, storage solutions, and query interfaces to enable VLA researchers to efficiently discover, query, and consume curated datasets.
- Collaborate closely with VLA researchers to capture evolving data requirements and continuously improve data pipelines through analysis and experimentation.
- Design and scale distributed data-processing pipelines capable of handling petabyte-scale multimodal datasets (e.g., RGB/Depth, point clouds) with full lineage and reproducibility.
- Define data-quality metrics and build feedback loops to continuously monitor and improve data quality.
Requirements
Required Qualifications
【1. Academic & Professional】
- Master’s degree in Computer Science, Engineering, or related field (or equivalent practical experience).
- 5+ years professional experience in data engineering / data platform development.
- Proven record of delivering production-grade, distributed data systems.
【2. ETL / Distributed Data Processing】
- 3+ years designing and operating large-scale ETL / ELT pipelines using Spark, Flink, Ray or similar distributed engine.
- Hands-on xperience with using orchestration tools and designing pipelines (Airflow, Kedro, Dagster).
- Proven optimization of workloads (10TB+/day scale).
【3. Lakehouse / Storage Architecture】
- Designed or led implementations using Delta Lake, Apache Iceberg, or Hudi.
- Integrated with Trino, Athena, Databricks SQL, or Glue/Unity Catalog.
- Defined schema evolution, ACID compliance, partitioning strategy, time travel, and cost-performance optimization.
- Managed metadata, lineage, and catalog governance.
- Equivalent experience (e.g., BigQuery-based warehouse with versioned schema management) will also be recognized.
【4. Data Modeling / Quality / Governance】
- Built bronze/silver/gold data layer structures with dbt or equivalent.
- Defined and enforced data quality SLAs (freshness, completeness, accuracy).
- Experience with Great Expectations, DataHub, OpenMetadata, or Monte Carlo.
- Implemented schema versioning, audit logging, and lineage tracking.
- Designed and owned data access control and catalog taxonomy.
【5. Domain Understanding & Business Value】
- Collaborated with product / analytics / AI teams to align platform design with business KPIs.
- Quantified platform impact (e.g., ↓30% compute cost, ↑3× query performance).
- Can explain how architecture decisions drive measurable business outcomes.
Preferred Qualifications
- Experience working with terabyte or petabyte-scale datasets.
- Expertise in data lake storage systems such as Apache Iceberg or Delta Lake with query systems such as Trino and catalog systems such as Nessie.
- Expertise in distributed processing frameworks like Spark, Flink, or Ray.
- Expertise in workflow tools such as Airflow, Kedro, or Dagster.
- Experience in analyzing, monitoring, and managing data quality.
Others (linguistic qualification, etc.)
【Highly appreciated】 English proficiency at business level; Japanese proficiency a plus.
Benefits
There are currently no comparable projects in the world that collect data and develop foundation models on such a large scale. As mentioned above, this is one of Japan’s leading national projects, supported by a substantial investment of 20.5 billion yen from NEDO.
This position will play a crucial role in determining the success of the project. You will have broad discretion and responsibility, and we are confident that, if successful, you will gain both a great sense of achievement and the opportunity to make a meaningful contribution to society.
Furthermore, we strongly encourage engineers to actively build their careers through this project—for example, by publishing research papers and engaging in academic activities.
About AI Robot Association (AIRoA)
AIRoA is a non-profit, cross-industry association in Japan building an open and scalable robot data ecosystem that enables a wide range of researchers, startups, and companies to develop advanced AI robots together.
They collect and integrate large-scale motion, perception, and interaction data from robots operating in factories, warehouses, homes, hospitals, and construction sites, and transform it into shared robot foundation models and datasets. By providing this common infrastructure, AIRoA accelerates AI robot development in and from Japan and supports the real-world deployment of robots that help tackle social challenges such as labor shortages, aging populations, and the need for safer and more efficient workplaces.
Get Job Alerts
Sign up for our newsletter to get hand-picked tech jobs in Japan – straight to your inbox.




