Skip to Main Content

Job Title


Data Version Control in Analytics DevOps Paradigm


Company : DVC


Location : Adelaide, South Australia


Created : 2024-05-04


Job Type : Full Time


Job Description

The eternal dream of almost every Data Scientist today is to spend all the time exploring new datasets, engineering new features, inventing and validating cool new algorithms and strategies. However, daily routines of a Data Scientist include raw data pre-processing, dealing with infrastructure, bringing models to production. That''s where good DevOps practices and skills are essential and will certainly be beneficial for industrial Data Scientists as they can address the above-mentioned challenges in a self-service manner.The primary mission of DevOps is to help the teams to resolve various Tech Opsinfrastructure, tools and pipeline issues.At the other hand, as mentioned in the conceptual review byForbes in November 2016, the industrial analytics is no more going to be driven by datascientists alone. It requires an investment in DevOps skills, practices andsupporting technology to move analytics out of the lab and into the business.There are evenvoices calling Data Scientists to concentrate on agile methodology and DevOps if theylike to retain their jobs in business in the long run.Why DevOps MattersThe eternal dream of almost every Data Scientist today is to spend all (well,almost all) the time in the office exploring new datasets, engineering decisivenew features, inventing and validating cool new algorithms and strategies.However, reality is often different. One of the unfortunate daily routines of aData Scientist work is to do raw data pre-processing. It usually translates tothe challenges toPull all kinds of necessary data from a variety of sourcesInternal data sources like ERP, CRM, POS systems, or data from onlinee-commerce platformsExtract, transform, and load the dataRelate and join the data sourcesAggregate and transform the dataAvoid technical and performance drawbacks when everything ends up inone big table at the endFacilitate continuous machine learning and decision-making in abusiness-ready frameworkUtilize historic data to train the machine learning models and algorithmsUse the current, up-to-date data for decision-makingExport back the resulting decisions/recommendations to review by businessstakeholders, either back into the ERP system or some other data warehouseAnother big challenge is to organize collaboration and data/model sharinginside and across the boundaries of teams of Data Scientists and SoftwareEngineers.DevOps skills as well as effective instruments will certainly be beneficial forindustrial Data Scientists as they can address the above-mentioned challenges ina self-service manner.Can DVC Be a Solution?Data Version Control or simply DVC comes to the scenewhenever you start looking for effective DevOps-for-Analytics instruments.DVC is an open source tool for data science projects. It makes your data scienceprojects reproducible by automatically building data dependency graph (DAG).Your code and the dependencies could be easily shared by Git, and data throughcloud storage (AWS S3, GCP) in a single DVC environment.Although DVC was created for machine learning developers and data scientistsoriginally , it appearedto be useful beyond it. Since it brings proven engineering practices to notwell defined ML process, I discovered it to have enormous potential as anAnalytical DevOps instrument.It clearly helps to manage a big fraction of DevOps issues in daily DataScientist routinesPull all kinds of necessary data from a variety of sources. Once youconfigure and script your data extraction jobs with DVC, it will bepersistent and operable across your data and service infrastructureExtract, transform, and load the data. ETL is going to be easy andrepeatable once you configure it with DVC scripting. It will become a solidpipeline to operate without major supportive effort. Moreover, it will trackall changes and trigger an alert for updates in the pipeline steps via DAG.Facilitate continuous machine learning and decision-making. The part ofthe pipeline facilitated through DVC scripting can be jobs to upload databack to any transactional system (like ERP, ERM, CRM etc.), warehouse or datamart. It will then be exposed to business stakeholders to make intelligentdata-driven decisions.Share your algorithms and data. Machine Learning modeling is an iterativeprocess and it is extremely important to keep track of your steps,dependencies between the steps, dependencies between your code and data filesand all code running arguments. This becomes even more important andcomplicated in a team environment where data scientists collaboration takesa serious amount of the teams effort. DVC will be the arm to help you withit.One of the juicy features of DVC is ability to support multiple technologystacks. Whether you prefer R or use promising Python-based implementations foryour industrial data products, DVC will be able to support your pipelineproperly. You can see it in action for bothPython-based andR-based technical stacks.As such, DVC is going to be one of the tools you would enjoy to use if/when youembark on building continual analytical environment for your system or acrossyour organization.Continual Analytical Environment and DevOpsBuilding a production pipeline is quite different from building amachine-learning prototype on a local laptop. Many teams and companies face thechallenges there.At the bare minimum, the following requirements shall be met when you move yoursolution into productionPeriodic re-training of the models/algorithmsEase of re-deployment and configuration changes in the systemEfficiency and high performance of real-time scoring the new out-of-sampleobservationsAvailability of the monitor model performance over timeAdaptive ETL and ability to manage new data feeds and transactional systemsas data sources for AI and machine learning toolsScaling to really big data operationsSecurity and Authorized access levels to different areas of the analyticalsystemsSolid backup and recovery processes/toolsThis goes into the territory traditionally inhabited by DevOps. Data Scientistsshould ideally learn to handle the part of those requirements themselves or atleast be informative consultants to classical DevOps gurus.DVC can help in many aspects of the production scenario above as it canorchestrate relevant tools and instruments through its scripting. In such asetup, DVC scripts will be sharable manifestation (and implementation) of yourproduction pipeline where each step can be transparently reviewed, easilymaintained, and changed as needed over time.Will DevOps Be Captivating?If you are further interested in understanding the ever-proliferating role ofDevOps in the modern Data Science and predictive analytics in business, thereare good resources for your review belowBy any mean, DVC is going to be a useful instrument to fill the multiple gapsbetween the classical in-lab old-school data science practices and growingdemands of business to build solid DevOps processes and workflows to streamlinemature and persistent data analytics. #J-18808-Ljbffr