The Statistical Dharma: transition to data science

Showing posts with label transition to data science. Show all posts

Wednesday, November 1, 2023

Transitioning to Data Science

Ask yourself questions:
Before you consider switching to data science, you should first ask whether you actually need to do so. Whether you have the necessary skillsets and aptitude to do so. This is because data science is purely applied math. It’s not really something cool, technical, or extremely high-tech. And if you don't love mathematics, you will hate it. Data science is not a kind of complex computer programming. Computer programming is used as a tool in data science. At its core, data science is essentially mathematics. And if you don't enjoy math, you won't like data science.

If you don't enjoy programming, attempt to learn the basic fundamentals. You could try to delegate your programming, but at the very least, you need to have some appreciation for data, and that can only happen when you have some understanding of mathematics. If you think that trigonometry and calculus don’t have any practical applications, you probably aren't meant for data science.

However, if you have advanced to the point where you feel that you have the necessary clarity of thought and are certain that you want to switch to data science. Then, there are two ways to go about doing that. First is the traditional method, and the unconventional method

Traditional Method of Transitioning to Data Science:
The traditional route is to en-roll in a data science programme. Do make sure that it must be a full-time programme offered by a reputed university. In Indian Context, the Indian Statistical Institute(ISI), the IISc Bangalore, the IITs, are great places to look for. This course must be a full-time course not a course that you would do over the weekend. You must dedicate one or two years to that programme, putting everything else on hold.

Unconventional Method of Transitioning to Data Science:
The unconventional approach is to complete some sort of data science project and build a portfolio. Then, using that to transition by getting recommendations from friends, understanding the specifics of a given data science job posting, and working to improve on the same. If you start applying to 10–20 data science positions, and you prepare for every position in a very specific way over the course of, say, five-six months or a year, you will know what you need to do, to be a successful Data Scientist.

Now the question arises: How do you build a portfolio, and how do you create data science projects from scratch? To do that, you have to identify problems in your life that you care deeply about and turn those into data science projects. Either solve those problems or at the very least find an approach and try to consider what might be a potential solution. If you believe that, while you have identified a data science problem, you are stuck somewhere or that you do not actually have a data science problem to solve, I've made a number of videos explaining how to start data science projects and what sorts of projects fall under the data science umbrella. You may watch those videos.

You can get in touch with me through email, a message on LinkedIn, Instagram, or through visiting my verified Topmate profile: https://topmate.io/ashish_gourav.

Friday, October 6, 2023

Artificial Intelligence & Machine Learning's Most Important Concepts for Interviews

Let’s learn the most important concepts of Artificial Intelligence and Machine Learning through alphabets!

ANN
Artificial Neural Network (ANN) mimics the human brain’s neural network. Hidden units of the hidden layer in the artificial neural network architecture can be thought of as Neurons and they introduce non-linearity with inputs and few biases.

Bagging

Bagging means making multiple training sets from a training set and training separate models on those separate training sets and finally averaging them out.

Correlation
Correlation measures the relationship between two features or variables.

Deep Learning
The cornerstone of deep learning is artificial neural networks. Whenever you have artificial neural network with many hidden layers, its most likely a deep learning problem which is being solved.

Error
In a supervised machine learning setup, you always have an actual number on which you are supervising the model and a predicted and an estimated number and the difference between these two numbers is error. You tend to minimise the error in these kinds of machine learning problems which are a kind of supervised machine learning problems.

Feature
Feature or variable or input, all it means is what you feed into your data science models.

Gradient Descent
Gradient descent is a way to approximate global minima using local minima following an Iterative approach.

Hypothesis Testing
Using test statistic, you either reject the null hypothesis in the favour of alternative or fail to reject the null.

Intercept
Intercept in linear regression is that parameter which has the same impact for all inputs. For example, if your intercept is 3, the impact of intercept on y will always be 3, irrespective of x.

Julia
Julia is the next breakthrough in data science programming, as it is specially designed for quantitative researchers or data scientists.

KNN
K-Nearest Neighbours or KNN is the method which can be adopted to both regression and classification. The key idea here is that here you find K nearest points of any data point, and using this create a new model. And most importantly, this is a non-parametric model, here K is the hyper-parameter.

Linear Regression
Linear Regression is the simplest, yet the most powerful predictive ML Model. Its components are the intercept and coefficients of all the features or predictors.

Model
Model is an abstraction or simplification of reality using mathematics and statistics.

Normal Distribution
People generally talk about normal distribution when they are new to data science. It is not the only statistical distribution of importance. There are others as well. To completely identify a normal distribution, you need its mean and variance. And the heuristics to understand whether a distribution is normal or not, is that it will be symmetric about its mean with bell shape.

Overfitting
Overfitting is prone to happen when the model is too complex, fits the training data well but fails miserably when presented with new data.

P- Value
P-value measures the strength of rejection or failure to rejection. In statistical terms, it is the observed level of significance.

Q-Q Plot
Q-Q Plot compares two probability distributions. It is a Test for Normality.

Random Forest
Random forest combines the number of decision trees similar to bagging with a twist on the choice of features while making the decision trees.

SVD
SVD along with PCA is a dimension reduction technique and if you feed into it p features, when p is very large, it gives you a smaller set of modified features. In PCA terms, most important principal components which helps you in reducing the dimensionality of the problem.

T Test
T-test in statistical terms is a statistical test to compare means of two different distributions.

Underfitting
Underfitting is the opposite of overfitting. It is a deficient model which has missed out on important patterns or features while training a model.

Variance
Variance measures how spread the data is about the mean. So, if a data set is 2, 2, 2, 2…., it will definitely have a very low variance, which in this case is zero. And, if let’s say, data is 2, 5, 10, 15, 30, definitely its variance is going to be very high and not zero.

Web Scraping
Web Scraping is a risky way to get data from websites. Hence, read all the possible violations while doing web scraping.

X (Inputs)
X is what you now.

Y (Outputs)
Y is what you need to know.

Z-Score
Z-Score measures how far data point is from the mean. In terms of standard deviation, just subtract the mean from the data point and divide it by the standard deviation.

Tuesday, June 27, 2023

Can Economists become Data Scientists

It is a well-known fact that the key role of an economist is optimisation of costs for the maximisation of payoffs. Data scientists also perform a relatively similar job. They too optimise the cost functions to maximise the model fit. They both are highly interlinked job profiles. The tools that economists and data scientists use are the same. Perhaps, economists are the best suited to become data scientists, not engineers.

Apart from the overt generalisations which have been mentioned above that economists optimise the payoffs and minimise the costs, it might not be true always. But that is the general theme of economics because economics is based on resource constraints. Resources are scarce, so the use of available resources needs to be optimised.

The same happens with data scientists. Data scientists don’t have a lot of data or their clients or their problems don’t have really infinite resources. Some kind of modulations is needed be given to those resources, so that they are getting optimised appropriately. There are a plenty of data science problems in which resources are to be optimised and payoffs are to be maximised.

Generally, economists used data science as one of its tools and one of its techniques. And, data scientists are all technique. So, if you are a data scientist, without a domain expertise, you are a kind of technician. But suppose that you have a domain expertise, maybe you are a healthcare data scientist, so you are high in demand having a specific skill set. So, an economist and a domain specific data scientists have a lot in common.

Tools You must know that in economics, there is something called econometrics which essentially comprises of a lot of regression. Also, if you begin any data science course or any machine learning course, they will start teaching you regression, simple and multiple correlation, and what happens when your data has some kind of deviation from the assumptions of regression. These are the specific things which are discussed in detail in econometrics because econometrics is perhaps the most powerful tool of economics. Economics uses statistics, math, econometrics and a lot of other tools, which are analytical in nature and so does the data science industry. Therefore, economists are well suited to become a data scientist.

Are Software Engineers better Data Scientists or Product Managers or MBAs? Software engineers have an extremely difficult task at hand and they are quite good at it. But just because data science has a bit of coding in its ambit, you cannot say that software engineers are better suited to become data scientists. But yes, software engineers come in a lot of shapes and forms, so may be there is a specific kind of software engineer who really is specializing in the artificial intelligence and machine learning domain. So they are quite invaluable data scientists. But you cannot say that all software engineers are meant to be data scientists. There is a subset of software engineers which is kind of connected to data science. But economics loosely is all about analytical tools and using models to describe the ideas, the theories and the philosophies, that ultimately run the trade of goods and services. Therefore, it can be said that an economist will be a traditional data scientist and an AI/ML software engineer will be a kind of maverick and that person will really bring in cutting edge innovation in data science and both of them are needed.

So, if you still have a question that on studying economics can make you a data analyst or a data scientist. The answer is yes, but the only point over here is that if you are beginning your economics education, don’t think it through like this will be a stepping stone to a data science career. If you want to study data science, these days there are specialised courses in data science. Now a days, there are a lot of practitioners in data science not really trained in data science. They have just become one. But in future, data scientists are going to be trained. So, if you really believe that this is what you want to do in your life, you can find a plenty of resources for that. But if you believe that you want to become an economist, then do so and study economics because economics also is a pretty interesting field of study and it has a lot of value. The predictions which economists make is just one part of the value pyramid and it has a very high degree of explanatory power and interpretation which an artificial neutral networks or the ML models don’t have. They are a kind of black box models. This is the main reason why economists bring a lot of value to data science.

So don’t be scared of switching from economics to data science, because they are really close siblings.

Monday, April 24, 2023

How I became a Data Scientist at Big 4

Engineering Days

I began my quest of becoming a data scientist long before it was promoted as the hottest job of the century. I had no idea I was on the path to becoming a Data Scientist. It all started in my second year of B.Tech. I began to consider which career option to pursue next. I was fairly certain that I would not continue my engineering studies. Then, Actuarial Science was suggested to me, and I realised that this may be an interesting career for me. But I abandoned the notion because it required years of writing countless exams. At that point of my engineering, my thoughts were that I am done with my education, and now it was time to start my profession and start making money. How wrong was I!!

CFA Level 1

I started looking for jobs that required Maths. It was always in the back of my mind that if Actuarial Science was necessary for the job, I could pursue it. Later, I realised that Finance will have a lot of Mathematics that I will enjoy, so I passed CFA level 1. CFA level 1 provided me with sufficient finance knowledge, and I was not applying the knowledge in my everyday job position, so I saw no use in pursuing any further levels.

ISI and CMI

I was thinking about the ISI MSQE programme at the same time. This is a Maths & Economics course, and after passing the CFA Level 1 exam, I learned that Economics is the science of Finance.

I arrived at ISI for MSQE programme after a lengthy delay. I studied quantitative economics, and this was one of the few correct steps towards becoming a Data Scientist. Though, this is not a straight path.

I had also qualified for the CMI data science course when I first joined ISI. But I chose ISI Kolkata over CMI because I wanted to study at ISI. The fact that MSQE offers courses in Math, Economics, Statistics, and Finance and they are related to Data Science, was another logical justification.

Becoming a Data Scientist

I gave numerous internship interviews at ISI before receiving a Data Science internship at a gaming company. I then started to apply for jobs, and I eventually got a Data Science position at a consulting firm. Although I wouldn't describe myself as an experienced Data Scientist at that point of time, but I was always crunching numbers in some capacity. Any kind of data analysis be it in Finance, or Economics or hardcore Data Science, can be referred as a Data Science, and data was involved in my work. The work you perform in the course of your employment determines whether you are a Data Scientist or not, not your job title.

I therefore became a Data Scientist after receiving education and training in Engineering and Economics, and this is what happens for many other people as well. In most cases, they are not trained to be Data Scientists; instead, they become one.

Wednesday, February 8, 2023

Data Engineer vs Data Analyst vs Data Scientist - A Practical Comparison & Perspective!

There are various Data roles like Data Scientists, Data Engineers, Data Analysts, etc. Having a clear difference amongst these roles is important, so that you can select the most appropriate path for your journey. Below is a brief description & differences amongst the three roles: Data Engineer, Data Scientist, and Data Analyst.

Data Engineer: Data Engineer is a person who is given the task to get and handle the data. Data engineers are well acquainted with coding and algorithms. They use these algorithms for data cleaning and data handling.

In current times, data is considered a valuable asset, so data engineers are responsible for maintaining the entire data architecture and data pipeline for an organisation. They handle raw & unstructured data and convert it into a usable format so that the data can be made available for further analysis to Data Scientists and Data Analysts.

Data Engineers are not directly involved in the decision-making of a business. They work as a backend for the entire data team and indirectly help in data-driven decisions. Tools like SQL, MongoDB, Python, etc. are used by Data Engineers.

Data Analyst: Data Analyst is the person next in the pipeline of a data science project. They receive data from data engineers and perform analysis like EDA or any kind of elementary analysis. They analyse the structured data to find useful insights.

Data analysts use descriptive and inferential statistics for data analysis. Finding KPIs, and preparing reports are some of the day-to-day work of data analysts. They understand the current situation of an organisation and suggest recommendations for improvement.

The work of data analysts impacts the business directly. They suggest basic data-driven solutions that might be valuable for an organisation.

Data analysts use spreadsheet tools like excel and google sheets and dashboarding tools like Rshiny, Power BI and Tableau. Sometimes programming languages like Python are also used for data analytics. SQL is also widely used by Data Analysts.

Data Scientist: Data scientists are the key assets for all the data-related activities in an organisation. They are responsible for all the model development & deployment, checking the performance of models in production, and enhancing the existing model.

Data scientists possess the knowledge and understanding of Maths & Statistics, Programming Languages, and Machine Learning, as their task requires the use of all these three components.

Data scientists handle semi-structured or structured data and perform data preprocessing. Various ML models are applied according to the requirements and problem statement. They maintain the accuracy & performance of models that are part of a data science project.

The work of data scientists is helpful for businesses in predicting future events. Data scientists are directly involved in business decision making.

Data scientists extensively use programming languages like Python, R or SAS. (recently, Julia!)

Data scientists also perform elementary tasks like EDA, which are also done by Data Analysts.

Sometimes the roles of Data Analyst and Data Scientist are overlapping. In many firms, the Data Analyst can do the work of Data Scientists and vice versa. It depends on the company & its requirements.

But this is not the case for Data engineers. Data engineers can be thought of as Software Engineers whose task is completely different from others.

So this is all about the role and responsibilities of Data Scientists, Data Analysts, and Data Engineers. Hope! now you have a clear understanding of each role.

Saturday, January 21, 2023

How to Learn Data Science Smartly in 2023

Data science is considered the fastest-growing field in current times. Many professionals & students are currently interested in transitioning to this domain. However, learning and moving into a new profession is challenging. It requires structured steps and a solid plan to efficiently crack the domain. So, here we have presented a detailed roadmap that will help you accomplish your goal.

Step 1: Start with any spreadsheet tool like excel or google sheets. Carry out data manipulation, draw graphs and try to find insights from any dataset of your choice.

Step 2: Move to any programming language, be it R or Python. The task is to perform the same analysis in R or Python that you did in the spreadsheet tool. You'll come across some libraries like dplyr, ggplot2, etc. in R and Numpy, Pandas, etc. in python. These libraries will help you in data analysis.

To understand and master these libraries for data analysis, look over the internet, and you'll find a lot of tutorials for the same. Pick any one or at maximum two resources and start learning & implementing. In this process, you'll learn programming language as well as data analysis.

The best way to utilize the maximum from the above two steps is to always ask a lot of questions from the data. Then try to discover the answers with the help of excel, R and Python. This way, you'll not only learn the tool but would develop analytical thinking too.

Step 3: Now start studying statistics. Topics like conditional probability and Bayes theorem should be focused on. Then move to probability distribution, hypothesis testing, and statistical tests. The trick to master statistics is, first try to grasp basic ideas of multiple topics and then start solving the numerical problems. Then implement the learnings like probability distribution, hypothesis testing, and statistical tests in Excel and any programming language.

Congratulations! you have completed 50% of the journey and you are ahead of most of the beginner aspiring data scientists.

Step 4: Now comes the machine learning part. You’ll come across jargons like supervised and unsupervised learning, EDA, data preprocessing, and so on. But, don't get disheartened so easily. Start exploring why there is a such classification of topics, and what steps should be followed. Initially, don't try to understand everything, try to get an idea of the bare essentials. There are plenty of ideal resources out there for machine learning. Stick to a few of them.

Reaching this stage will roughly take anywhere between 3-4 months to 1 year. Now you are ready to work on quality and end-to-end projects. You can apply for internships or even jobs. If you want to study further, you can pursue higher studies in a good institute for data science.

Never get too hung up on completing the topics. Try to understand the why, when, and how of everything you are learning. The reason is, if you'll try to finish things in a short timeline, sooner or later you'll face issues in understanding the fundamentals of topics and you'll feel the need to revisit the topics. Hence, learn slowly but consistently.

It is said that "Little strokes fell great oaks"

All the best for your journey.

Sunday, January 15, 2023

Which One is Better for Data Science : IIT or ISI | Indian Statistical Institute Vs Indian Institute of Technology

Data Science, which is termed as the sexiest job of the 21st century has gained a lot of traction and eyeballs in the last few years. More and more people are trying to enter this field. And, to cope with the supply & demand, various institutes have started offering multiple programs related to Data Science, Machine Learning, or Artificial Intelligence. Two such premier institutes are IIT and ISI. These are known for their elite pedigree. But which one to prefer over the another? There are multiple deciding factors like location of the Institution, type of crowd, exposure, etc. We have focused on the curriculum/subjects, as it is the most important deciding factor.

Roughly speaking, we can divide the whole Data Science & Machine Learning work into two parts: Tech-focused and Statistics centric.

Tech-focused: This part of Data Science involves dealing with large sets of data, finding patterns in data using Machine Learning, AI models, and deploying the model in production. Coding or Programming is the key thing with an understanding of technology.

Examples include Advance Machine Learning, Deep Learning like Neutral Networks, etc.

Statistics Centric: This part of Data Science involves more of the explainable models, where most of the things depend on the parameters of the models. It involves the estimates of parameters which explain the complete model. Statistics & Maths are the key areas.

Examples include linear regression, statistical inference, time series forecasting, etc.

Disclaimer: Both ways are chosen individually or simultaneously for the given problem in different industries and different use cases.

Do you enjoy engineering more, or do you want theoretical studies more? That is the one question that should help you decide above anything else.

IITs are known to focus on engineering & technology whereas ISI has primarily a theme of Statistics.

If working in technology and coding appeals to you, then you can prefer IITs for Data Science. IITs are the best place to learn, work and implement technology with the brightest minds.

If you have an interest in numbers, and you love mathematics and its application in the real world, then ISI is for you. ISI is the best place to understand the Statistics & Maths behind Machine Learning. You get a chance to understand explainable ML, and analysis part of Data Science & how it can help the business.

ISI primarily focuses on traditional statistics & maths. Coding, deployment, and application are the added layers. The situation is the other way round in IITs.

It's not like IITians never learn mathematics or statistics and ISI students never do coding. It's about the curriculum, environment, and culture that differentiates.

Data Science is always a combination of Statistics, Programming, and Maths. Different institutes have different routes for this journey, giving priority to one section over another. At last, it boils down to your choice & preference.

Still, if you are confused now, the thing is you won't go wrong with either institute as both IITs and ISI are wonderful places to become Data scientists.

Monday, August 22, 2022

Easiest Way to Become a Data Scientist

No matter what so called experts would like to sell, don't fall for it. No online certification or networking is going to get you a Data Science job, if you are in a different industry.
So, you may ask what is the way out?
Yes! there is a smarter way.

Get yourself into a full-time degree. It doesn't have to be specifically in Data Science. You need to find something related to Statistics, Analytics, Computer Science. It could be something like an MBA in Business Analytics or something like a Masters in Statistics.
This will push you into a structured path, your only effort would be put into getting yourself up-skilled in Data tools & techniques.

At the end of the course, you will know which companies are looking for you. It's like you narrow down your search, the companies find it easier to access talent. Like win-win.

Of Course, no-one-size fits all.

Also, I have talked about it in a youtube video. Check it out below: