Stay in Touch

TITLE

 

Site Reliability Manager

COMPANY

 

Qubole

LOCATION

 

Bangalore, IN

Description

Qubole, the leading cloud-agnostic, big data-as-a-service provider, is passionate about making data-driven insights easily accessible to anyone. Qubole delivers the industry’s first autonomous data platform. The cloud-based data platform, Qubole Data Service (QDS), removes the burden of maintaining infrastructure of multiple big data processing engines, and enables customers to focus on their data. Qubole customers process nearly an exabyte of data every month. Qubole investors include Charles River, Institutional Venture Partners, Lightspeed, Norwest, Harmony and Singtel Innov8. 
 
WHAT YOU BE DOING .
 
Project Management: Effectively manage and deliver multiple initiatives in the Qubole operations team with high quality and on schedule. Work closely with engineering and product leadership to prioritize these initiatives to deliver the greatest value to the business.
 
Technical Leadership: Provide technical and process guidance on operational best practices for 24x7x365 availability and reliability across multiple Qubole production environments. Help in creating standardized processes for SRE/Site Operations engineers to monitor, alert, debug, and resolve production incidents. Lead the team’s response to production incidents and coordinate with engineering on resolving issues. When required, get to the code/infrastructure level to guide the team or resolve operational issues.
 
Incident Management : Drive the charter for incident management with effective feedback loop for engineering and also customers. Work with Qubole Security team to establish and document security controls and procedures. Develop/improve tools to automate the deployment, administration, and monitoring of a large-scale web service on AWS/Azure/GCP cloud.
 
People Management: Build up an SRE/SiteOps team to provide 24x7 coverage for all Qubole production environments. Coach and mentor operations engineers to gain valuable technical skills and grow in their career. Nurture a great place to work and build an environment that fosters engagement, innovation and positive energy.
 
Metrics Driven: Hands on experience in driving better metrics and monitoring initiatives to gain insights into the production environments.

MUST TO HAVE.

    • Strong experience in creating standardized and streamlined operational processes/tools for operating large and complex distributed systems.
    • Excellent prioritisation, time management abilities, and  focus on execution.
    • Knowledge of public clouds like AWS or Azure or Google and experience working with configuration and deploy management tool like Chef or Puppet.
    • Strong system administration background for Linux based systems.
    • Good verbal/written communication skills and leadership experience -- managing multiple initiatives, leading meetings, handling incident responses, presenting to diverse audiences, and building high performing teams
    • Strong problem solving skills – both in the technology and leadership/influence areas.
    • Operational expertise around deploying and managing components like MySQL, Nginx, ElasticSearch, Java Applications, RoR, Load Balancers, Grafana, etc.
    • Comfortable with networking fundamentals like Firewalls, Subnetting, Route tables etc.
    • Exposure to scripting languages like Bash, Python or Ruby is desired.
    • Operating experience with Kubernetes/Containers.
    • Degree in Computer Science/Engineering

NICE TO HAVE .

    • Experience building/managing large scale distributed web-based systems.
    • Experience with operating big data clusters like Hadoop, Presto, Spark etc

Apply for the job

Subscribe to our blog.


 

Blog & Newsletter Signup