Job Description:

Business Overview

The Technology Platforms Division (TPD) drives the growth of the Rakuten Ecosystem by delivering innovative, high-quality technology platforms characterized by integrated control and strategic partnerships.

Within TPD, the Ecosystem Platform Supervisory Department (EPSD) develops scalable and reliable platforms that support the entire Rakuten Ecosystem globally, fostering a culture of ownership and data-driven decision-making.

Department Overview

The Incentive Platform Department (INPD) provides incentive and payment solutions that empower Rakuten's businesses around the world. The platforms are continually adding capabilities and scale to accelerate the Rakuten Ecosystem synergies.

Point Service Section (PSS) is responsible for developing and operating Rakuten Point and Rakuten Coupon. We develop and manage Rakuten PointClub and other point-related web products, which are among the most popular sites in Rakuten. We drive continuous improvement to maximize Rakuten Point value and contribute to the Rakuten Ecosystem. By the time you finish reading this, hundreds of thousands of point transactions have been processed and many users have visited our web/app services. We want to work with a person who has a great passion for our services and products.

Position:

Position Details
As a Lead Site Reliability Engineer (Lead SRE), you will set the technical direction for the stable operation and continuous improvement of our mission-critical services. You will define and drive initiatives to enhance service reliability, scalability, and performance across the team, spanning incident response, automation, monitoring, capacity planning, and observability strategy. You will serve as the primary technical authority within the SRE team, guiding and upskilling other engineers while working closely with product development, infrastructure, and security teams.

Responsibilities
- Service Quality Definition & Achievement: Define Service Level Objectives (SLOs) and Service Level Agreements (SLAs). Lead the planning and execution of improvement activities to achieve them. Drive the adoption and operation of Error Budgets across the team.
- Performance & Latency Improvement: Identify bottlenecks in service performance and latency. Lead the team in proposing and implementing solutions, setting technical standards for performance work.
- Incident Management & Troubleshooting: Act as incident commander during production outages, leading rapid restoration efforts. Drive Root Cause Analysis (RCA) processes and the implementation of systemic preventative measures.
- Operational Efficiency & Automation: Champion the automation of operational processes to reduce toil. Architect scalable operational frameworks and establish best practices for the team.
- Technical Leadership & Mentorship: Provide technical guidance and mentorship to SRE team members. Conduct technical design reviews, define engineering standards, and contribute to the overall skill development of the team.
- Cross-functional Collaboration: Lead collaboration with product development teams, infrastructure teams, security teams, and other relevant departments. Foster a DevOps culture and drive alignment on reliability goals across the organization.
- On-call: Participate in and help shape the 24/7 on-call rotation, including refining escalation paths and runbooks.

Mandatory Qualifications:
- Bachelor's degree in Computer Science or related field, or equivalent practical experience.
- More than 5 years of hands-on experience in SRE, infrastructure engineering, or a related field, with demonstrated technical leadership experience.
- Experience building and operating production systems in public cloud (AWS, GCP, Azure, etc.) or private cloud environments.
- Extensive experience designing, building, operating, and scaling Kubernetes environments.
- Deep knowledge and hands-on experience building and operating modern monitoring, alerting, and logging tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
- In-depth knowledge of UNIX-like operating system internals and/or networking.
- Deep knowledge of IP network systems and protocols (TCP/IP, HTTP, etc.) and hands-on troubleshooting experience.
- Experience building automated workflows using CI/CD tools (e.g., Jenkins, CircleCI, GitLab, CI/CD).
- Experience developing operational automation tools and scripts using scripting languages such as Shell, Python, etc.
- Proven track record of leading production incident handling end-to-end (detection, triage, short-term / long-term fix, root cause analysis).
- Experience in system performance tuning and capacity planning.
- Proficiency with Git and GitHub for version control and collaboration.
- Strong communication, negotiation, and collaboration skills to articulate complex technical issues and align with internal and external stakeholders.

Desired Qualifications:
- Experience developing or maintaining GCP environments (e.g., GKE, Cloud Run, BigQuery, Cloud Monitoring, IAM).
- Experience in web application development.
- Deep knowledge and practical experience in observability, and a strong drive to improve services leveraging SLIs/SLOs.
- Experience implementing and operating error budgets, or a proven track record in toil reduction initiatives.
- Experience driving cross-team or org-wide reliability improvements (e.g., defining standards, leading postmortem culture).
- Experience working with cross-cultural global teams in different locations.

#engineer #producer #productmanager #technologyplatformdiv

Languages:

English (Overall - 4 - Fluent)

Lead Site Reliability Engineer (Lead SRE) - Incentive Platform Department (INPD)

Summary

Required Skills

Details

Description