Tokyo

JAPAN AI - AI Quality Scientist

Tokyo
Partial Remote
Full-time
March 9, 2026

About JAPAN AI

JAPAN AI, Inc. was established in April 2023 as a group company of Geniee, Inc. (TSE Growth Market) with the mission of dramatically expanding human potential through AI technology. We drive cutting-edge AI R&D both domestically and internationally.

Why We're Hiring

The output quality of AI agents is directly tied to enterprise operations. "Sort of working" is not acceptable.

In a world where JAPAN AI STUDIO functions as "the brain of the enterprise" — autonomously executing tasks such as approval workflows, resource allocation, and prospect discovery — a wrong AI output means approvals that should have been rejected go through, incorrect staffing decisions are made, and inappropriate customers are approached. For "the brain of the enterprise" to be trusted, a system that scientifically evaluates and guarantees the accuracy, safety, and consistency of generated responses is essential.

Traditional QA engineering has centered on test case design and execution. However, quality assurance for LLM agents demands ML/DS expertise — research and development of evaluation metrics themselves, LLM-as-Judge calibration theory, reward modeling, statistical experimental design, and benchmark design.

JAPAN AI is hiring an AI Quality Scientist to establish "AI Evaluation Science" — the discipline that Apple, Anthropic, Scale AI, and Google DeepMind are pioneering — within the context of Japanese enterprise AI.

Mission

"Science the quality of AI — prove agent reliability through evaluation research and development."

Quantitatively evaluate and improve LLM / AI agent output quality using methods from machine learning, statistics, and psychometrics. Establish "AI Evaluation Science" as a new research discipline within the company — from evaluation metric R&D to production deployment of automated evaluation pipelines — and scientifically guarantee the quality of products used in production by approximately 200 companies.

Role & Expectations

As an AI Quality Scientist, you will lead both the research and implementation aspects of AI agent quality evaluation.

Research and develop evaluation metrics — scientifically define "what constitutes quality" through LLM-as-Judge calibration, reward modeling, and benchmark design
Design and build automated evaluation pipelines — integrate research outcomes into production CI/CD to deliver scalable quality gates
Red teaming and safety verification — automate adversarial testing and build policy compliance verification frameworks
Drive quality improvement through statistical experimental design — quantitatively verify the effectiveness of prompt strategies and model changes through A/B tests and significance testing
Feed evaluation signals back to research and development teams — build a compound-interest loop for model improvement
Ensure the quality of products used in production by ~200 companies through a "science of quality" approach

Why You'll Love This Role

Evaluation Science in practice: Practice "AI Evaluation Science" — the discipline that Apple, Anthropic, Scale AI, and others are investing in — within the context of Japanese enterprise AI. This is a globally rare position where evaluation methodology itself is the research subject.
A new application of ML/DS skills: Apply your machine learning and statistics expertise not to "building models" but to "evaluating models." Intellectual challenges span both research and implementation — reward modeling, LLM-as-Judge calibration theory, and benchmark design.
Quality determines product trust: In a production environment used by ~200 companies, the evaluation infrastructure you build becomes the last line of defense for release quality. You will feel the direct business impact of quality assurance.
Greenfield position: Design and build the entirely new specialized domain of AI agent evaluation science from scratch. You will have significant autonomy — from evaluation metric R&D to production deployment of automated evaluation pipelines.
Frontline of AI safety: Engage in Responsible AI practices including automated red teaming, adversarial testing, and policy compliance verification. You will play a key role in scientifically guaranteeing safety in a world where AI agents autonomously execute business operations as "the brain of the enterprise."
Rapid-growth environment: In a startup that has grown to 200+ people and 9 products in just 3 years, you will have significant autonomy in technical decision-making. You will work closely with Research Engineers and Agent Harness Engineers, influencing quality across the entire product suite.

Job Description

Evaluation Metric Research & Development
- Research and implement LLM-as-Judge calibration methods (rubric design, bias detection, proper scoring rules)
- Design, build, and validate evaluation benchmarks (construct validity, contamination detection)
- Research the application of reward modeling / preference learning to evaluation
- Select and design evaluation metrics (win rate, task success, factuality, harm detection)
- Design, build, and maintain evaluation sets (synthetic data + real logs)
Automated Evaluation Pipeline Design & Development
- Design and implement scalable automated evaluation pipelines
- Integrate evaluation pipelines into CI/CD and build quality gates
- Design agent evaluation harnesses (multi-turn, tool use, long-context support)
- Ensure reproducibility and reliability of evaluation pipelines
Safety & Quality Verification
- Research and implement automated red teaming (automated adversarial testing)
- Build safety and policy compliance verification frameworks
- Research and implement hallucination detection and calibration methods
- Design and execute prompt / tool regression tests
Statistical Analysis & Experimental Design
- Design and analyze statistical experiments (A/B tests, significance testing)
- Visualize quality trends and automate regression detection
- Create quality reports and improvement proposals
- Feed evaluation signals back to research and development teams

Key Results (KR/Metrics)

Evaluation coverage rate (test case coverage)
Regression detection rate (pre-release quality degradation detection ≥ 95%)
Evaluation pipeline execution time (completed within CI/CD)
LLM-as-Judge and human evaluation agreement rate
False positive / false negative rate
Safety incident rate (post-release)

Team Structure

Approximately 120 members are part of the development organization. The AI Quality Scientist operates as a dedicated quality assurance function, collaborating closely with:

Agentic Product Engineer — Agent feature development
Research Engineer — Research and development, model improvement
Agent Harness Engineer / Software Engineer (AI Platform) — AI execution infrastructure development
Product Manager — Product design and quality requirements definition

You May Be a Good Fit If You

Education & Experience
- Master's degree or higher (or equivalent practical experience) in Computer Science, Machine Learning, Statistics, Mathematics, Physics, Psychometrics, or related fields
- 3+ years of practical experience as an ML Engineer, Data Scientist, Research Engineer, or in ML/AI evaluation-related roles
Technical Skills
- Deep knowledge of LLM / generative AI evaluation methods (benchmark design, LLM-as-Judge, quantitative output quality measurement, hallucination detection, etc.)
- Practical knowledge of statistics and experimental design (hypothesis testing, A/B testing, confidence intervals, effect sizes, etc.)
- Experience building ML / evaluation pipelines in Python
- Practical experience with machine learning frameworks (PyTorch, JAX, TensorFlow, etc.)
- Experience designing and implementing evaluation metrics (task-specific metric design beyond precision/recall)
Language requirement: (at least one)
- Japanese: Fluent — able to discuss product development without friction
- English: Business level

Strong Candidates May Also Have

Publication experience at top ML/NLP conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
Research or implementation experience with reward modeling / preference learning (RLHF, DPO, etc.)
Experience with LLM-as-Judge calibration and rubric design
Knowledge or experience in AI safety, Responsible AI, and red teaming
Experience with benchmark design and validity verification (IRT, construct validity)
Experience evaluating multi-agent workflows, tool use, and long-context scenarios
Large-scale data processing experience (Spark / BigQuery, etc.)
Experience integrating ML / evaluation pipelines into CI/CD
Ability to read, comprehend, and reproduce research papers

Tech Stack

Languages: Python (evaluation pipelines & analysis), TypeScript / React / Next.js (frontend) / NX
Evaluation/QA: pytest, LangSmith, Weights & Biases, custom eval frameworks
Data: BigQuery, Spark, Pandas
Infrastructure: GCP (containers / K8s), Docker, Terraform
CI/CD: GitHub Actions
Tools: Slack, Confluence, Linear, Google Workspace, GitHub, Notion
AI Dev Support: Claude Code MAX Plan, Cursor, ChatGPT, Devin
Work environment: Mac (Apple Silicon), dual monitors available

Learning & Development Support

AI Tool Usage Support: Company covers the cost of using AI tools such as JAPAN AI SaaS services, Cursor, ChatGPT, Claude, etc.
Development Tool Support: If a desired development tool is paid, the cost is covered (up to ¥30,000 per year)
Book Purchase Assistance: Company covers the cost of purchasing books for learning, such as technical books (up to ¥30,000 per half-year)
Language Learning / Qualification Support: Company covers the cost of Japanese or English learning programs and qualification acquisition
Refresh Allowance: Company covers the cost of services used for personal refreshment (up to ¥5,000 per month)
Housing Allowance: Housing allowance provided for those living in designated areas (up to ¥30,000 per month)

Hiring Process

Application Review
Coding Assessment
Interviews (4–5 rounds)
Offer

A reference check will be conducted prior to the final interview.

APPLY NOW ➜Japanese Required ⚠️

About Geniee

Geniee actively utilizes AI technology in product development and is an in-house product.

With “GENIEE SFA/CRM” and “GENIEE CHAT”, users can create automatic summarization of minutes using ChatGPT. Geniee provides AI-related functions, such as automatic email creation, that help customers improve their business efficiency and productivity. Under these circumstances, they provide implementation consulting, product provision, and services related to AI technology.

In order to further promote research and development, they’ve established a new subsidiary, “JAPAN AI Co., Ltd.” in April 2023. JAPAN AI Co., Ltd. has a purpose of “passing down Japan’s traditions and using AI to increase the potential of businesses.” They develop and provide various AI products to improve the productivity of Japanese companies and revitalize the industry. In order to develop advanced products, they also conducting research in areas such as various large-scale language models such as ChatGPT and Generative AI.