职位描述

#1889

Senior ML Infra Engineer

上海

美国

工程

English

Key Responsibilities - Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads - Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable - Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system - Troubleshoot and resolve issues related to OS, storage, network, and GPUs Challenges You Will Tackle: design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure. Requirements: - Solid experience in running production machine learning infrastructure at a large scale - Experience in designing, deploying, profiling and troubleshooting in Linux-based computing environments - Proficiency in containerization, parallel computing and distributed training algorithms - Experience with storage solutions for large scale, cluster-based data intensive workloads Bonus qualification: - Experience of supporting machine learning researchers or data scientists for production workloads

Contact Our Consultant

Zoy Wang

Consultant

Surrienta Consulting Ltd. @2024