ABOUT THIS FEATURED OPPORTUNITY
The Data Engineer will play a critical role in providing machine learning engineers with access to high-quality training data, optimizing data processing workflows, and improving scalability for a toolkit used by business teams to configure their chatbots.
THE OPPORTUNITY FOR YOU
You will collaborate with machine learning engineers to understand their data requirements and ensure access to high-quality training data. You will identify and address issues with data quality, including system messages, foreign languages, and other anomalies. You will develop and implement efficient data processing pipelines to preprocess and clean large datasets. You will optimize data processing workflows to reduce processing time and improve scalability, aiming to lower processing time. You will work closely with MLEs to parallelize and optimize their code for scalability, ensuring it is suitable for large-scale data processing. You will design and implement data views, filters, and other tools to facilitate data exploration and troubleshooting. You will implement and manage data pipelines using Airflow to automate and orchestrate data workflows. The ideal candidate will stay updated with advancements in the data engineering space, including distributed computing for ML data processing, and incorporate relevant technologies into our infrastructure.
KEY SUCCESS FACTORS
- 4+ years of experience in Data Engineering i.e. proven experience in large batch processing jobs and optimizing data processing workflows for scalability.
- Strong proficiency in Python programming
- Expertise in big data tools such as Spark/PySpark or Dask.
NICE TO HAVES
- Familiarity with MongoDB for efficient data storage and retrieval
- Experience with Apache Airflow for orchestrating complex data pipelines
- Experience in the LLM or text data space, with a focus on distributed computing for ML data processing (not optimizing training models)
#LI-MM1