Site Reliability Engineer

NGS Super • Sydney Region • 3w ago

Introduction About us We are an award winning, national $18B public offer industry fund focused on the education and community sectors. Working for NGS Super means being part of something bigger and working to make a difference to our members and their financial future. Our people are key to our success, and as we expand, we’re committed to finding and retaining the right talent to take on the journey. As well as a flexible and fun workplace, we offer competitive benefits including additional leave entitlements, personal & professional development and health & wellbeing programs. The Role Working within the Engineering team the Site Reliability Engineer (SRE) will play a pivotal role in improving the reliability, performance, and operational resilience of NGS Super’s technology platforms. This is a hands-on role for someone who enjoys building foundations: you’ll partner closely with engineers and technology stakeholders to define and embed pragmatic SRE practices (SLOs/SLIs, incident management, post-incident reviews, runbooks, automation, and continuous improvement). You’ll help shape how we build and operate services so they are observable, performant, scalable, secure, and resilient. This role will be embedded in the development lifecycle, contributing to resilient architectures for NGS digital products and platforms. As part of a collaborative, agile team, you’ll be accountable for SLOs and help drive measurable improvements in uptime and MTTD/MTTR. The role is based in our Sydney CBD office. What you’ll do Establish, maintain, and report on service reliability targets (SLIs/SLOs/error budgets) with technology and product teams. Improve observability across services by designing effective metrics, logging, tracing, dashboards, alerting, and actionable signals. Lead and refine incident response practices, including triage, escalation, and stakeholder communications. Facilitate blameless post-incident reviews and drive remediation actions to completion (root cause analysis and preventative actions). Build and maintain operational documentation including runbooks, playbooks, and standard operating procedures. Reduce operational toil through automation (e.g., self-service operations, safe deployments, and automated recovery). Partner with engineers on reliability-focused architecture, safe releases, and change risk management to improve service quality. Improve performance and capacity management by identifying bottlenecks, forecasting demand, and validating scalability assumptions. Contribute to security, compliance, and resilience through best-practice governance, incident and root cause analysis, security reviews, and audit support. Contribute to disaster recovery planning and testing (DR runbooks, RTO/RPO validation, and regular exercises). Participate in on-call support for production incidents, including after-hours coverage when required. Collaborate with Engineering, InfoSec, IT Operations, and Data teams to improve operational readiness and continuous DevOps improvement. About you We are looking for someone who brings fresh experience and thinking to the superannuation industry to help drive our growth journey. You will bring the following key skills: 5 years’ experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or a similar operations-focused engineering role. Strong hands-on experience improving production service reliability through incident response, root cause analysis, and proactive remediation. Practical experience defining and using SLIs/SLOs/error budgets to guide prioritisation and continuous improvement. Hands-on experience with observability tooling and practices (logging, metrics, traces, alerting, dashboards). Strong troubleshooting skills across application, infrastructure, and network layers; able to diagnose complex, multi-service incidents under pressure. Hands-on experience with cloud platforms and cloud-native architectures (ideally AWS). Experience improving deployment safety (CI/CD, infrastructure-as-code, release controls/rollback strategies, and change risk management). Ability to write maintainable automation and tooling (scripting and/or general-purpose programming such as Node.js, Python, or JavaScript). Strong security fundamentals and experience working in environments with governance and compliance requirements. Clear, collaborative communicator who contributes to documentation, continuous improvement, and coaching others. The following skills are nice to have but not essential: Experience working in a regulated environment (e.g., financial services/APRA). Experience implementing or maturing an incident management practice (on-call design, escalation policies, post-incident reviews, runbooks/playbooks). Experience with serverless and/or container platforms and orchestration. Experience with performance engineering and capacity planning. Cloud certifications (e.g., AWS). Experience building reliability practices in a newly formed or growing engineering team. If you are willing to take on challenges, problem-solve creatively, and bring fresh ideas to the table, we would love to hear from you. We are a super fund that has an exceptional work culture, provides a diverse offering in developing our people and you can be a part of it while earning an attractive remuneration package! Please note that to be eligible for this role, you are required to have permanent Australian working rights and residency. We are an equal opportunity employer committed to creating a workplace that values diversity, equity, and respect for all individuals.