Unlimited Job Postings Subscription - $99/yr!

Job Details

Senior Software Engineer - AI Infrastructure

  2026-02-12     Oracle     Honolulu,HI  
Description:

Job Description

Join OCI (Oracle Cloud Infrastructure) AI Infrastructure in revolutionizing the future of AI/ML/HPC workloads with our ultra-high-performance GPU platform. This is an exciting opportunity to shape systems that enable our customers to effortlessly scale from tens to thousands of GPUs while maintaining peak performance.

As a member of the GPU Availability and Monitoring team within our Compute Organization, you'll play a crucial role in designing and developing enhancements for GPU delivery, health monitoring, triage automation, and diagnostic services that are vital for executing distributed workloads across extensive GPU networks, utilizing advanced technologies like RoCE and Infiniband.

We are looking for a talented Senior Software Engineer who is passionate about innovation and technological excellence. In this pivotal role, you will influence our technology stack while making significant contributions to cloud infrastructure and automation.

Your responsibilities will include designing, developing, troubleshooting, and debugging software programs across various cloud infrastructure components, including databases, applications, tools, and networks. Utilize your AI and ML expertise to help us remain at the forefront of the industry. Collaborate and inspire our team to excel.

If you are eager to leverage your skills and have a lasting impact, we would love to hear from you!

Responsibilities

  • Work autonomously in ambiguous situations while adhering to established standards and practices.

  • Design, develop, troubleshoot, and debug software for various cloud infrastructure components.

  • Actively define and refine standard practices and procedures for AI-driven software engineering development.

  • Develop software solutions that involve designing, debugging, and enhancing applications or operating systems using AI and ML methodologies.

  • Lead critical projects such as:

  • Implementing ML algorithms for spike detection in provisioning failures to minimize operational disruptions.

  • Enhancing integrations with Kafka to facilitate near real-time actions for hardware repairs, aligning with 1-Day SLO goals through event-driven architecture and stream processing.

  • Creating an automated ticket routing system to enhance workflows and efficiency using NLP and ML.

  • Driving collaborative initiatives with cross-functional teams to utilize AI-driven insights and recommendations.

  • Using AI and ML to develop innovative tools for automating testing, simulating environments, and reproducing incidents.

  • Lead technical discussions across teams for seamless integrations and efficient problem-solving, and mentor junior engineers to foster their growth.

Technical Skills:

  • Proficient in Python, Java, and TypeScript.

  • Familiar with Agile development methodologies.

  • Experience in data management, data modeling, and data governance.

  • Knowledge of cloud infrastructures: OCI, AWS, Azure, and Google Cloud Platform.

  • Skilled in Linux and MacOS operating systems.

  • Proficient in scripting languages such as Bash, Perl, and Ruby.

  • Familiar with container technologies like Docker.

  • Experience in designing and developing RESTful APIs and understanding API security.

  • Familiarity with API documentation tools like Swagger/OpenAPI.

  • Experience with AI tools, chatbots, and predictive analytics.

  • Robust background in Linux systems.

  • Knowledge of system-level architecture, data synchronization, fault tolerance, and state management.

  • General experience in enterprise storage, networking, or computing.

  • Good understanding of SQL databases and caching technologies such as Redis and Memcache.

Disclaimer:

Certain customer-facing roles in the U.S. may require compliance with applicable immunization and occupational health mandates.

Salary and Benefits:

The salary range for this role is between $79,200 and $178,100 per annum, with potential bonuses and equity opportunities. Oracle provides a comprehensive benefits package, including medical, dental, vision insurance, short-term and long-term disability, life insurance, flexible spending accounts, 401(k) with company match, paid time off, paid holidays, paid sick leave, paid parental leave, and employee stock purchase plans.

Oracle values diversity and is an equal opportunity employer, committed to creating an inclusive environment for all employees.


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search