MISSION
Lead the R&D of the AI models that power Shizuku’s voice and intelligence. With TTS (Text-to-Speech) as the core pillar, push the boundaries across NLP, speech recognition, and — looking ahead — computer vision and humanoid robotics, evolving Shizuku’s expressive capabilities across multiple modalities.
Balance continuous improvement of production TTS models with exploration and development of next-generation architectures, while owning the MLOps cycle to drive Shizuku’s ongoing evolution.
ABOUT SHIZUKU
Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.
As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.
TEAM STRUCTURE
You will work directly alongside co-founder Aki — an ML engineer and researcher with experience at Meta and Luma AI — to drive Shizuku’s model development. Expect daily sparring sessions on research direction and architecture design with a founder who brings firsthand experience at the frontier. Initially, you’ll handle lightweight MLOps pipeline work yourself; as we hire a dedicated MLOps engineer, responsibilities will gradually separate.
DEVELOPMENT ENVIRONMENT & RESOURCES
- Existing Models: A TTS model is already in production. You’ll drive improvements in parallel with next-gen model exploration
- Training Data: Shizuku’s publicly available YouTube data serves as a foundational dataset. You’ll be involved from collection pipeline design onward
- Evaluation Infrastructure: TTS quality evaluation framework is greenfield — you’ll design evaluation criteria (MOS, PESQ, etc.) from scratch
KEY RESPONSIBILITIES
- Own the full TTS model lifecycle: research, architecture design, training, evaluation, and iterative improvement
- Continuously improve production TTS models while exploring and prototyping next-generation architectures
- Design and build TTS quality evaluation infrastructure and define evaluation criteria
- Expand into multimodal domains: NLP, speech recognition, and future frontiers including vision and humanoid robotics
- Design training data collection pipelines, preprocessing workflows, and quality assurance processes
- Build and operate the MLOps cycle — training, evaluation, and deployment — until a dedicated hire is in place
- Collaborate with the SWE team on production integration: inference optimization, latency reduction, and more
REQUIREMENTS
- 2+ years of deep expertise and hands-on experience in at least one of: NLP, speech (TTS/ASR), or computer vision
- Experience training, evaluating, and improving models using deep learning frameworks such as PyTorch
- End-to-end ownership of the ML workflow: from data preparation and experiment management to model deployment
- Track record of independently surveying papers, reproducing implementations, and applying findings to production systems
- Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)
NICE TO HAVE
- Research or development experience in TTS (VITS, Grad-TTS, NaturalSpeech, StyleTTS, etc.)
- Development experience in robotics or autonomous driving domains
- Technical knowledge in speaker adaptation, emotion control, and prosody modeling for speech synthesis
- Experience developing ASR, NLP, or multimodal models
- Experience building and operating GPU training environments (A100, L4, etc.) on AWS/GCP
- Experience with model development in Slurm environments, particularly multi-node training setups
- Proficiency with experiment tracking tools: MLflow, Weights & Biases, DVC, etc.
- Experience with inference optimization using ONNX Runtime, TensorRT, vLLM, etc.
- Peer-reviewed publications in related fields
- Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)
WHO YOU ARE
- Deep Expertise with Cross-Domain Reach — You bring rigorous depth in a specific modality while reaching across TTS, NLP, vision, and beyond. You don’t say “that’s outside my specialty” — you do what Shizuku’s evolution demands
- Zero-to-One Explorer — You go beyond applying existing methods. You formulate hypotheses for uncharted technical challenges, iterate through validation cycles, and tackle questions that have no known answers
- Purpose-Driven Ownership — You reverse-engineer from the goal of “making Shizuku’s models better,” crossing the boundaries of research, implementation, and operations to drive outcomes autonomously
- Comfort with Ambiguity — You define your own success metrics and build collection pipelines from scratch in an environment where nothing is predefined
- Humility & Respect — You collaborate authentically with teammates who bring different areas of expertise
About Shizuku AI
Shizuku AI is an AI-native entertainment company reimagining how people interact with technology through its AI companion, Shizuku.
The company builds real-time, interactive AI characters inspired by Japanese IP culture, blending generative AI, live streaming, and storytelling to create experiences where users can form meaningful relationships with AI.
As an early-stage startup, Shizuku AI is backed by top-tier investors including Andreessen Horowitz (a16z) and DeNA. The team is looking for individuals who are excited about building impactful new forms of entertainment powered by AI to join them.
The company’s mission is to create the world’s most lovable AI companion, supported by a global team based in San Francisco and Tokyo.
Get Job Alerts
Sign up for our newsletter to get hand-picked tech jobs in Japan – straight to your inbox.






