Key Responsibilities
• Ensure reliability, availability, and performance of production trading and research systems through proactive monitoring and incident management.
• Manage monitoring, alerting, incident response, and troubleshooting in production environments.
• Conduct incident response, root cause analysis (RCA), and postmortem processes to continuously improve system stability.
• Build automation to streamline deployment, scaling, and operational workflows, reducing manual intervention.
• Optimize system performance and scalability across distributed, data-intensive environments.
• Manage and enhance CI/CD pipelines and infrastructure, including containerized and orchestration platforms (e.g., Kubernetes).
• Collaborate with engineering, quant, and trading teams to improve system reliability and operational efficiency.
Qualifications Required
• Bachelor’s degree or higher in Computer Science, Engineering, or a related field.
• 3+ years of experience in production engineering, site reliability engineering, or backend systems engineering.
• Strong programming skills in Python.
• Solid understanding of Linux systems, networking, and distributed systems.
• Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack).
• Understanding of FIX protocol or trading workflows.
Preferred
• Experience in financial services, trading systems, or low-latency environments.
• Familiarity with market data systems and electronic trading infrastructure.
• Experience with Kubernetes, Docker, or cloud infrastructure.
• Exposure to high-performance or distributed computing environments.