company logo
#1889
Senior ML Infra Engineer
上海
美国
工程
English
Key Responsibilities - Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads - Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable - Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system - Troubleshoot and resolve issues related to OS, storage, network, and GPUs Challenges You Will Tackle: design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure. Requirements: - Solid experience in running production machine learning infrastructure at a large scale - Experience in designing, deploying, profiling and troubleshooting in Linux-based computing environments - Proficiency in containerization, parallel computing and distributed training algorithms - Experience with storage solutions for large scale, cluster-based data intensive workloads Bonus qualification: - Experience of supporting machine learning researchers or data scientists for production workloads
Contact Our Consultant
avatar
Zoy Wang
Consultant
wechat