NPU Runtime Software Engineer

Rebellions

Rebellions

Software Engineering

Seongnam-si, Gyeonggi-do, South Korea

Posted on Mar 28, 2026

We are seeking a highly skilled NPU Runtime Software Engineer to join our team. You will be responsible for designing and implementing the software layer that bridges high-level ML frameworks with our proprietary NPU hardware — enabling the next generation of real-time AI applications. Your work will ensure that state-of-the-art models — with a heavy focus on LLMs — run with industry-leading efficiency, low latency, and high throughput. You will sit at the intersection of compilers, system drivers, and distributed inference frameworks, spanning the full runtime stack from graph execution and compiler integration to inference serving.

Responsibilities and Opportunities

  • Design and implement the RBLN runtime module that interfaces with compiler and driver components, including the graph executor and runtime APIs, to enable ML model deployment through the RBLN SDK
  • Architect and maintain native PyTorch execution support within the runtime, including torch.compile integration and RBLN compiler toolchains, to enable seamless NPU acceleration with minimal user-side code changes
  • Design and implement a user-facing profiler that provides actionable performance insights, delivered as part of the RBLN SDK
  • Develop and extend vLLM to enhance inference performance on NPUs, including support for key vLLM features such as advanced memory management, parallelism, and dynamic batching
  • Design and optimize distributed inference across multi-NPU setups, including collective communication operations (CCL) to support various parallelism strategies
  • Conduct benchmarking and profiling to evaluate runtime system performance and implement optimizations to improve overall system efficiency
  • Collaborate with ML engineers and infrastructure teams to deploy and scale inference services

Key Qualifications

  • Bachelor's degree or higher in Computer Science, Electrical Engineering, or a related field
  • Strong proficiency in C++ and Python
  • Strong understanding of deep learning fundamentals and LLM architectures, including Transformer-based models, generative AI, and inference optimization techniques
  • Hands-on experience with LLM serving frameworks (e.g., vLLM, TensorRT-LLM)
  • Solid understanding of model optimization techniques (tensor parallelism, KV cache optimizations, memory-efficient execution)
  • Familiarity with system software components, including compilers, runtimes, drivers, and firmware
  • Familiarity with hardware acceleration (GPUs, NPUs, TPUs) and efficient memory management techniques
  • Strong debugging and performance profiling skills for high-throughput inference environments
  • Ability to work effectively across compiler, driver, and ML engineering teams
  • Excellent written and verbal communication skills

Ideal Qualifications

  • Practical experience with AI accelerator runtimes and driver APIs (e.g., GPUs)
  • Direct contribution or production experience with ML frameworks and serving systems such as PyTorch, vLLM, SGLang, TensorRT, and TensorRT-LLM
  • Understanding of torch.compile and graph optimizations
  • Strong understanding of operating systems, resource management, and high-performance computing concepts
  • Advanced proficiency in modern C++ for developing efficient, high-performance systems
  • Experience with multithreading and parallel programming
  • Experience deploying LLMs in distributed environments

Rebellions is committed to fostering a diverse and inclusive workplace. We are an equal opportunity employer and value diversity within our company. We do not discriminate based on personal identity. Applicants who would like to contact us regarding the accessibility of our website or who need special assistance or a reasonable accommodation for any part of the application or hiring process may contact us at: recruit@rebellions.ai.