The other day a very senior Executive of a prominent body that honors excellence in major Internet based media types, was talking about chat bots, AI, and other exciting technological developments citing examples of how online streaming sites like Spotify, powered by smart algorithms, are bringing in an altogether new level of personalization (never seen before) in the services sector. The presentation was mesmerizing, not only for its contents but also for one particular comment: “We no longer need PhDs to spend hours writing codes because machine learning is there to take care of it”.
Even though it is not the first time in the recent past that someone has made such a comment, it just goes on to show how thick the fog surrounding some of these increasingly used buzz words are! The chicken or the egg causality dilemma is also apparent, as the path leading up to AI is not well understood outside the data science community. AI derived through machine learning is a team effort at its best. It starts with the application of specialist knowledge of statistics on big datasets that have been cleaned for warehousing in databases and made accessible through cloud to enable analyses and simulations to understand the past and predict the future, with the ultimate goal of creating qualified use cases. Based on the use case in question, specific machine learning algorithms takes over the process of automation and starts detecting patterns and driving outcomes as more data pours in, thereby leading to AI.
As data undergoes a serious makeover from being an invaluable supporting information, produced and exploited almost exclusively by academic and scientific institutions, to becoming a product in itself, marketed and sold, a lot of people are grappling with the definition, specificity and interchangeability of terms like: data science, data analytics, big data, predictive modeling, machine learning, business intelligence, data architecture, data warehousing, etc.
Most job postings in Data Science, in search for candidates to do everything (with data), other than classical BI and Data Engineering, are often advertised under the title ‘Data Scientist’. But what is a Data Scientist?
For a start, Data Science could include specialties like applied statistics, data analytics, data simulation and predictive modeling, machine learning, among others. However, the definition is still very blurred and one that is constantly evolving. ‘Data Scientist’ is certainly a very generic term, just in the same way ‘Computer Scientist’ is.
Data Scientists, based on specialization, could be branded as:
1/ Data strategists (good business domain knowledge in setting up use cases; has data wisdom; ability to translate business needs into data science projects; capable of identifying and filling up data gaps; not a product or ML engineer; limited technical expertise in ML/Data engineering)
2/ Data Analysts/Business Analysts (deriving insights from large datasets using applied statistics and predictive modeling; strong business acumen; not a product person, not an engineer)
3/ Data enablers/Data science engineers (capable of automating data extraction from DB/data lakes; expert in feature engineering, preparing and manipulating data for deriving insights using advanced analytics and ML; limited business acumen; not a product person)
4/ Machine Learning Experts/Engineers/Developers (a ML expert, able to extract data from DB/Data lake, clean it, and run ML techniques to derive insights; transform insights into product and advise product development; little data wisdom; little business acumen)
Despite this distinction based on hardcore technical skills, there are of course areas of overlap, which enables a specialist in one area to understand the nuances in another at a broader level. This is further enhanced by a person’s interest, exposure, experience and other factors that go on to define capabilities, which is unique and individual. The branding part, however, has to evolve in consensus with the data science community, technological advancements and corresponding business needs.
With the transition of data to a product of direct commercial value, or compelling by-product for developing the next generation product, a clear differentiation has arisen between Data Science and Data Engineering in the past couple of years. Needless to emphasize that any data science team requires to incorporate or collaborate very closely with one or several data engineers.
Data Engineering can now be broadly divided into specializations like:
1/ Data-warehouse/BI specialist (classical/dimensional data warehousing, relational databases, data governance, ETL processes, reporting, among others)
2/ Architect/Infra/”Big Data” Data Engineer (data infrastructure and architecture, relational and non-relational databasing, big data ecology at the core)
3/ Product data engineer (DevOps, continuous integration and deployment, big data ecology are key skills)
For employers to fully exploit the various skill sets out there and adapt it to their business needs, it is imperative that they start off with a clear vision and business strategy around data or data based products. It is the key for successfully hiring the right talents, managing expectations on both sides, and retaining them, while working towards a great partnership between data scientists and other business stakeholders.
On the other hand, as a community of Data Scientists, it is also upon us to ask the right questions whether at an interview or at project conceptualization to understand and advise on the skills and resource requirements, which will be needed to build a team or work as part of an existing one, based on the direction and expected outcome.
In practice, especially with countless Internet startups whose business model and value proposition is centered on data, the actual job of a data scientist could very well begin with identifying relevant data sources and obtaining adequate datasets to get started. This could also involve setting up primary and secondary market research on key targets, which certainly requires profound segment, market and business knowledge. Even though it is perfectly feasible that a few unicorn Data Scientists might exist who can cover everything from business to data science, including software development/deployment, searching and recruiting them would be as easy as locating a needle in a hay stack.
Even though, it might seem to make good financial sense for a business to seek and hire a few unicorns (Data Scientists) to turn data into gold, it is an expectation neither side will live up to. We must put an end to the illusion (through engagement and interaction) that all highly technical faculties can be found in one person. To this end, the nonsensical Venn diagrams that continuously feed the fantasy by trying to identify certain target profiles sitting at the intersection of various skill sets have to go. Data People are T-shape (maybe π-shape for some) profiles with respect to skill sets: broad range of skills (mathematics/statistics, database/software/product engineering, computing, domain knowledge, technical project management and communication) along with high-level specialization(s).
To navigate through the Data Science ecology, a skills matrix is presented below. This may be useful, but still far from complete. Inputs and feedback will be extremely useful to build from it.