NPU Runtime Software Engineer

Rebellions

Software Engineering

Seongnam-si, Gyeonggi-do, South Korea

Posted on Mar 28, 2026

Apply now

We are seeking a highly skilled NPU Runtime Software Engineer to join our team. You will be responsible for designing and implementing the software layer that bridges high-level ML frameworks with our proprietary NPU hardware — enabling the next generation of real-time AI applications. Your work will ensure that state-of-the-art models — with a heavy focus on LLMs — run with industry-leading efficiency, low latency, and high throughput. You will sit at the intersection of compilers, system drivers, and distributed inference frameworks, spanning the full runtime stack from graph execution and compiler integration to inference serving.

Responsibilities and Opportunities

Design and implement the RBLN runtime module that interfaces with compiler and driver components, including the graph executor and runtime APIs, to enable ML model deployment through the RBLN SDK
Architect and maintain native PyTorch execution support within the runtime, including torch.compile integration and RBLN compiler toolchains, to enable seamless NPU acceleration with minimal user-side code changes
Design and implement a user-facing profiler that provides actionable performance insights, delivered as part of the RBLN SDK
Develop and extend vLLM to enhance inference performance on NPUs, including support for key vLLM features such as advanced memory management, parallelism, and dynamic batching
Design and optimize distributed inference across multi-NPU setups, including collective communication operations (CCL) to support various parallelism strategies
Conduct benchmarking and profiling to evaluate runtime system performance and implement optimizations to improve overall system efficiency
Collaborate with ML engineers and infrastructure teams to deploy and scale inference services

Key Qualifications

Bachelor's degree or higher in Computer Science, Electrical Engineering, or a related field
Strong proficiency in C++ and Python
Strong understanding of deep learning fundamentals and LLM architectures, including Transformer-based models, generative AI, and inference optimization techniques
Hands-on experience with LLM serving frameworks (e.g., vLLM, TensorRT-LLM)
Solid understanding of model optimization techniques (tensor parallelism, KV cache optimizations, memory-efficient execution)
Familiarity with system software components, including compilers, runtimes, drivers, and firmware
Familiarity with hardware acceleration (GPUs, NPUs, TPUs) and efficient memory management techniques
Strong debugging and performance profiling skills for high-throughput inference environments
Ability to work effectively across compiler, driver, and ML engineering teams
Excellent written and verbal communication skills

Ideal Qualifications

Practical experience with AI accelerator runtimes and driver APIs (e.g., GPUs)
Direct contribution or production experience with ML frameworks and serving systems such as PyTorch, vLLM, SGLang, TensorRT, and TensorRT-LLM
Understanding of torch.compile and graph optimizations
Strong understanding of operating systems, resource management, and high-performance computing concepts
Advanced proficiency in modern C++ for developing efficient, high-performance systems
Experience with multithreading and parallel programming
Experience deploying LLMs in distributed environments

Rebellions is committed to fostering a diverse and inclusive workplace. We are an equal opportunity employer and value diversity within our company. We do not discriminate based on personal identity. Applicants who would like to contact us regarding the accessibility of our website or who need special assistance or a reasonable accommodation for any part of the application or hiring process may contact us at: recruit@rebellions.ai.

Apply now

See more open positions at Rebellions

Powered by Getro.com

Privacy policy Cookie policy