HPC Infrastructure DevOps Engineer II
St. Jude Children's Research Hospital
Software Engineering, Other Engineering
Memphis, TN, USA
USD 86,320-154,960 / year
The World’s Most Dedicated Never Give Up - There’s a reason St. Jude Children’s Research Hospital consistently earns a Glassdoor Employee Choice Award and is named to its "Best Place to Work" list. Because at our world-class pediatric research hospital, every one of our professionals shares our commitment to make a difference in the lives of the children we serve. There’s a unique bond when you’re part of a team that gives their all to advance the treatments and cures of pediatric catastrophic diseases. The result is a collaborative, positive environment where everyone, regardless of their role, receives the resources, support, and encouragement to advance and grow their careers and be the force behind the cures.
St. Jude is where those with a passion for making a difference come to break new ground! Located in Memphis, Tennessee, the mission of St. Jude Children’s Research Hospital is to advance cures, and means of prevention, for pediatric catastrophic diseases through research and treatment. We are leading the way the world understands, treats, and defeats childhood cancer and other life-threatening diseases.
Position Overview
St. Jude is seeking an HPC Infrastructure DevOps Engineer II to join the High-Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation, and continuous improvement of St. Jude’s high-performance computing environment, with a focus on HPC operations, DevOps practices, and automation for configuration, testing, monitoring, and autonomous remediation. The position supports a modern research computing ecosystem spanning on-premises and remote-site infrastructure, including:
• HPC compute platforms for research and data-intensive workloads
• GPU-enabled environments for AI and machine learning applications
• High-capacity research, compliant, and scratch storage tiers
• Archival, backup, and disaster recovery services
• Operational tooling for observability, governance, and process automation
Working closely with infrastructure, storage, security, and research teams, the HPC Infrastructure DevOps Engineer II will deliver reliable and scalable services for computational science, regulated workflows, and AI-enabled research. This role is central to the HPCS service portfolio, including daily HPC client request fulfillment, performance and utilization monitoring, data management and governance, data cataloguing and archival services, and HPC process automation DevOps.
Job Responsibilities
HPC Infrastructure Operations
- Support the day-to-day operation of St. Jude’s HPC infrastructure across compute and storage platforms.
- Maintain a stable, secure, and scalable environment for research computing and data-intensive scientific workflows.
- Work with downstream operational teams to ensure systems are configured, validated, monitored, patched, and maintained effectively.
- Participate in infrastructure testing, upgrade activities, service transitions, and operational readiness efforts.
- Contribute to the reliability and supportability of hybrid HPC environments spanning primary and remote-site services.
Daily HPC Client Request Fulfillment
- Respond to daily user requests involving HPC access, Linux environment support, storage allocation, software availability, job troubleshooting, and data movement.
- Provide timely and effective support to researchers, analysts, and technical staff using HPC and AI-enabled research resources.
- Resolve service incidents and user issues through structured troubleshooting and escalation as needed.
- Maintain service-oriented communication with users and stakeholders to support a high-quality support experience.
Performance and Utilization Monitoring
- Implement and improve monitoring for compute nodes, GPU resources, scheduler activity, storage systems, backup operations, and platform health.
- Track usage trends, availability, capacity consumption, and operational KPIs to support efficient service delivery.
- Analyze utilization patterns and recommend improvements to throughput, performance tuning, scheduling efficiency, and user experience.
- Build and maintain dashboards, metrics collection workflows, health checks, and alerting mechanisms to support proactive operations and continuous process improvement.
- Support governance reporting and visibility into service consumption and infrastructure health.
Data Management and Governance
- Support operational controls for research and compliant data across active storage, protected environments, backup systems, and archival tiers.
- Implement and maintain standards for data handling, retention, access control, traceability, and lifecycle operations.
- Contribute to governance tracking and reporting for HPC-supported data services.
- Assist with data movement and retention workflows across high-performance, compliant, backup, and archival storage platforms.
Data Cataloguing and Archival Services
- Support data intake, metadata-aware cataloguing, archival placement, recall, restore validation, and tier-to-tier data movement.
- Assist with workflows involving archival platforms, cold storage, backup systems, and long-term retention services.
- Improve discoverability and lifecycle management of research datasets through automation and procedural standardization.
- Support operational validation of archival and recovery workflows for critical data services.
HPC Process Automation DevOps
- Use automation tooling to handle system configuration, provisioning, platform maintenance, testing, and operational workflows.
- Enable DevOps lifecycle functions by supporting tooling and processes for development, testing, release, and operational support.
- Build and maintain CI/CD pipelines and repeatable infrastructure workflows to improve reliability, consistency, and deployment speed.
- Reduce manual effort by developing scripts, integrations, and self-service mechanisms for recurring HPCS operational tasks.
- Apply automation and generative AI tools responsibly to improve scripting, documentation, incident analysis, and support efficiency.
AI and Accelerated Computing Support
- Assist with deployment and support of AI software stacks, containerized research environments, and Python-based computational workflows.
- Help optimize GPU utilization, data throughput, storage access patterns, and job execution for AI training and inference use cases.
- Support reproducible environments for AI applications through dependency management and platform validation.
- Maintain GPU software stacks, containerized runtimes, dependency consistency, and high-throughput data access for distributed AI training and inference.
- Contribute to operational support for research teams using container runtimes, distributed job workflows, and accelerator-aware scheduling.
Security, Risk, and Incident Response
- Identify and deploy security measures through vulnerability assessment, configuration review, patch coordination, and risk-aware operational practices.
- Participate in incident response, troubleshooting, and root cause analysis for infrastructure and service disruptions.
- Support backup validation, restore readiness, and disaster recovery operational practices for critical HPC services.
- Follow institutional requirements for secure handling of research and regulated data.
Collaboration and Continuous Improvement
- Work with end users and partner teams to understand operational needs, user requirements, and service KPIs.
- Coordinate across technical teams to improve service quality, communication, and execution.
- Contribute to ongoing process improvement initiatives that reduce lead time, strengthen platform reliability, and improve user experience.
- Maintain accurate technical documentation for systems, configurations, procedures, and knowledge articles.
- Perform other duties as assigned to support the goals and objectives of the department and institution.
Minimum Education
- Bachelor's degree in Computer Science, Engineering, business or related field of study required.
- Master's degree preferred.
Minimum Experience
- Minimum requirement: Two (2) years of IT experience with experience in infrastructure operations and engineering environments.
- Some experience in infrastructure design, systems analysis, and security management.
- Some experience working with business stakeholders to identify and document requirements.
- Proven performance in earlier role/comparable role.
Preferred Qualifications
Infrastructure & Operations
- Experience supporting Linux-based enterprise or research computing environments.
- Experience with scripting or automation using Python, Bash, or similar languages.
- Familiarity with DevOps and infrastructure-as-code practices using tools such as Ansible, Terraform, Git-based workflows, or CI/CD platforms.
- Experience with observability platforms for logs, metrics, dashboards, and alerting.
- Familiarity with HPC workload schedulers such as IBM LSF, Slurm, or comparable systems.
- Experience supporting high-performance storage, backup, and archival services.
- Familiarity with containers and reproducible compute environments such as Singularity, Docker, or related platforms.
- Understanding of secure multi-user platform operations in research or regulated environments.
AI / ML Environment
- Experience supporting GPU-based systems for AI and machine learning workloads.
- Familiarity with CUDA-enabled environments, GPU monitoring, and NVIDIA software stack dependencies (CUDA, cuDNN, NCCL).
- Experience with AI/ML ecosystems such as PyTorch, TensorFlow, Jupyter, and distributed training workflows.
- Familiarity with distributed and multi-GPU training frameworks such as PyTorch Distributed, DeepSpeed, Horovod, or Ray.
- Understanding of data pipeline, storage throughput, checkpointing, and large dataset staging requirements for model training and inference.
- Familiarity with operational practices adjacent to MLOps, including experiment tracking, artifact handling, workflow automation, and workload observability.
- Understanding of secure and compliant support for AI workloads operating on sensitive research data.
- Ability to troubleshoot AI application issues across infrastructure, scheduler, storage, container, and accelerator layers.
Generative AI Productivity
- Familiarity with generative AI tools that improve productivity in infrastructure operations, DevOps, and technical support workflows.
- Ability to use generative AI assistants to accelerate scripting, automation, troubleshooting, and documentation tasks.
- Familiarity with prompt design and iterative prompting techniques for script development, log analysis, workflow generation, and systems diagnostics.
- Understanding of the limitations, risks, and verification requirements of generative AI outputs, including accuracy validation, security awareness, and protection of sensitive or regulated data.
- Ability to identify practical use cases for generative AI that reduce manual effort across HPC support, governance tracking, and process automation.
Core Competencies
- Strong troubleshooting and analytical skills.
- Service-oriented mindset with excellent communication and follow-through.
- Ability to work effectively across infrastructure, operations, and user-facing support functions.
- Commitment to documentation, process discipline, and continuous improvement.
- Comfort operating in a mission-driven research and regulated data environment.
- Ability to balance technical depth with responsiveness in a fast-paced, high-stakes setting.
Compensation
In recognition of certain U.S. state and municipal pay transparency laws, St. Jude is including a reasonable estimate of the compensation range for this role. This is an estimate offered in good faith and a specific salary offer takes into account factors that are considered in making compensation decisions including but not limited to skill sets, experience and training, licensure and certifications, and other business and organizational needs. It is not typical for an individual to be hired at or near the top of the salary range and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current salary range is $86,320 - $154,960 per year for the role of HPC Infrastructure DevOps Engineer II.Explore our exceptional benefits!
No Search Firms
St. Jude Children's Research Hospital does not accept unsolicited assistance from search firms for employment opportunities. Please do not call or email. All resumes submitted by search firms to any employee or other representative at St. Jude via email, the internet or in any form and/or method without a valid written search agreement in place and approved by HR will result in no fee being paid in the event the candidate is hired by St. Jude.