Key Responsibilities
- Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads
- Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable
- Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system
- Troubleshoot and resolve issues related to OS, storage, network, and GPUs
Challenges You Will Tackle: design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure.
Requirements:
- Solid experience in running production machine learning infrastructure at a large scale
- Experience in designing, deploying, profiling and troubleshooting in Linux-based computing environments
- Proficiency in containerization, parallel computing and distributed training algorithms
- Experience with storage solutions for large scale, cluster-based data intensive workloads
Bonus qualification:
- Experience of supporting machine learning researchers or data scientists for production workloads