USNLX Diversity Jobs

USNLX Diversity Careers

Job Information

Microsoft Corporation Principal AI/HPC Software Engineer in Multiple Locations, United States

Azure is building modernistic accelerated supercomputers at unforeseen scales to facilitate the massive computational demands of the world’s leading generative AI. Microsoft’s Eagle cluster, a Graphics Processing Unit (GPU)-accelerated supercomputer, is a noteworthy example achieving the coveted #3 and #2 ranks in Top500 and MLPerf benchmarks respectively. The Azure Artificial Intelligence (AI) high performance computing team is looking for a Principal AI/HPC Software Engineer to benchmark, profile, debug and tune the generative AI applications running in the production infrastructure. Sophisticated tools and techniques are needed to maintain the reliability, runtime performance, and health of the hundreds of nodes in a supercomputer consisting of thousands of GPUs. The candidate will work closely with customers, who are building the world’s leading generative AI, to understand the characteristics of their workloads, profile them to find performance bottlenecks, and instrument best known state-of-the-art and novel tools and techniques to achieve the smooth operation of the AI jobs. As a contributing member of the core group of engineers in Azure, the candidate would also bring to the table best practices driving architectural changes and influence roadmap of relevant software and hardware components. Your work will directly impact the business goals of a wide range of users and facilitate the next wave of growth and innovation in AI, and HPC in the cloud in general.

We are looking for a Principal AI/HPC Software Engineer who is about quality, wants the customer to succeed and get things done. You will join a phenomenal team of engineers and researchers with deep experience in high performance computing, machine learning, deep learning, middleware, and software engineering. The following values drive us:

  • Drive for Results: We’re here to build great products. We take on whatever work is right for the product and strive for the best possible results.

  • Modesty and Adaptability: The right answer is more important than being right. We search for solutions as a team, adapt quickly and value transparent and open feedback.

Your mission will be to help ensure the Azure platform is consistent on performance, can scale on-demand, and engineered to withstand the unparalleled computing demand from the customer workloads. You will help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

  • Identifies, tracks, and assesses features in parallel programming layers (such as CUDA or HIP C++) to improve throughput or latency on state-of-the-art GPU hardware, rack-level instruments, or datacenters; compiles and submits data, analyses, and reports.

  • Analyzes the runtime profiles or call graphs of parallel programs running synchronously on hundreds to thousands of devices (GPUs) concurrently, analogous to known High Performance Computing simulation workloads (e.g., NAMD, LINPACK, SEISMIC).

  • Develops additional instrumentation in application code to log runtime characteristics if not available in standard tools.

  • Communicates with CPU or GPU architects to understand the intellectual merit, performance characteristics, and overhead or readiness of hardware features and supporting software.

  • Reproduces novel ideas and optimization techniques from published literature to accelerate generative AI training and inferencing; develops proofs of concepts and measures their impact on critical applications' end-to-end runtime.

  • Analyzes overheads and performance characteristics of critical software frameworks (e.g., PyTorch, Nvidia CUDA, AMD HIP) in the end-to-end runtime of generative AI training and inferencing.

  • Manages, oversees, provides guidance to, and reviews the work of individual contributors and people managers to accomplish operational plans and results.

Embody our Culture (https://www.microsoft.com/en-us/about/corporate-values) and Values (https://careers.microsoft.com/us/en/culture)

Qualifications

Required Qualifications:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

  • OR equivalent experience

  • 6+ years of experience in software design and development

  • 3+ years of experience in developing and running AI/HPC applications on clusters

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: 

  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • PhD in Computer Science, Electrical Engineering, or related areas

  • Exposure to operational challenges of running HPC systems (availability, fault tolerance) and mitigation mechanisms

  • Previous experience with running and troubleshooting machine learning workloads on GPU clusters is a plus

  • Exposure to Cloud Computing, Virtualization and Container Technologies

  • Familiarity with HPC software stack

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until July 26, 2024.

#azurecorejobs

Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .

DirectEmployers