Research Engineer

Technology

1 Openings

Our client is a small, funded team working at the intersection of AI models, data, and evals. Most AI work focuses on the model. This role focuses on the benchmarks, evals, and data that prove it. They build across software engineering, healthcare, financial services, and agentic AI. The problems change often; the bar stays high. Every hire shapes the company.

YOUR ROLE

This role builds the benchmarks, evals, and RL environments shipped to AI lab clients, covering tool use, agentic behaviour, and domain reasoning, and the methodology that makes them defensible: capability taxonomies, rubric architecture, reward signals, and inter-annotator reliability studies. You take a benchmark from task design through expert authorship, scoring, and failure analysis, and you run models against it to find where they break and how often.

You work closely with domain experts authoring tasks, building the tooling, dashboards, and pipelines that turn their work into a benchmark a lab can trust and compare runs against. The role reports to the Head of Engineering (currently the Founder), and the output is used directly by AI lab clients to evaluate and improve their models.

WHAT YOU WILL DO

Benchmarking: Design and build benchmarks task by task: source artifacts, capability buckets, weighting, and negative criteria, following the company’s Gold Standard methodology, and keep the taxonomy and scoring aligned with what each AI lab client is trying to measure.
Evaluation runs: Build the pipeline that runs models against a benchmark, scores the rollouts, and turns the results into a report a lab can act on: pass@k, mean reward, variance, and the bucket-level breakdown.
Failure analysis: Read model rollouts closely to find where and how they fail, and turn recurring failure patterns into new negative criteria or rubric refinements.
Rubrics and IRR: Write and refine task rubrics with domain experts, build the LLM-judge grading pipeline against them, and run the inter-annotator reliability and calibration sessions that keep scores defensible.
RL environments and reward signals: Design RL environments and reward signals for post-training, and build the reward verification checks that catch reward hijacking and misaligned reward functions before a client sees them.
Authoring funnel: Track the expert authoring funnel and data quality end to end, applications, work samples, accepted tasks, IRR scores, and use it to decide where the annotation pipeline needs to change.

WHAT WE ARE LOOKING FOR

Applied experience in model evaluation, benchmarking, or post-training, with strong software or ML engineering fundamentals.
Strong coding skills, comfortable working with ML models and writing evaluation code.
Solid backend fundamentals: APIs, databases, and cloud infrastructure for running and storing evaluation results at scale.
Judgment to read evaluation results and model behaviour and draw the right conclusions about what they mean.
Comfort designing rubrics and scoring criteria that hold up when an AI lab pushes back on them.
High ownership, strong execution, and comfort with ambiguity in a high-iteration environment.

NICE TO HAVE

Experience on a team that builds evals, benchmarks, or post-training data for AI labs.
Experience designing synthetic scenarios or RL environments used for reward shaping.
A research background in evaluation or benchmarking, including published work.
Public work, eval frameworks, benchmark suites, or write-ups that show how you think about evaluation.

WHY JOIN

Top-of-market compensation for the right candidate; they pay for depth, not years on a CV.
Your work directly shapes how frontier AI models are evaluated and improved.
Early access to the techniques and benchmarks behind the most advanced models in the world.
High ownership from day one: no ticket queues, no layers, direct impact.
A culture that rewards engineering and research depth and curiosity over credentials.

Recruitment Notice

“Due to high interest, our team connects only with candidates whose profiles closely match the role mandate.”

Exploring your next leadership move?

Most leadership roles never reach job boards. ExecEdge helps senior professionals access the hidden leadership market through positioning, outreach, and warm introductions.