The Statistical Dharma: AI

Showing posts with label AI. Show all posts

Friday, October 6, 2023

Use ChatGPT to become a 10x Professional

ChatGPT is an advanced language model developed by OpenAI that has gained widespread recognition and adoption across various industries. It utilizes deep learning techniques to generate human-like responses and engage in interactive conversations with users. The capabilities of ChatGPT extend far beyond simple question-answering, making it an invaluable tool for businesses and individuals alike.

Let us look into the most efficient usage of the superpowers of ChatGPT in detail.

Learn the basics of any new topic
Suppose that you want to learn about NLP, which is a completely alien topic for you. You can simply enter it in ChatGPT and you will get a detailed article on NLP.

Search faster
You are just a click away to search a topic. In ChatGPT, you can get the results within a fraction of seconds.

Code Faster
You can even code faster using ChatGPT. You have to just write the instructions for the code you require. ChatGPT will do it for you for your choice of language.

Pop cultural references
Along with all these benefits, you can also get pop cultural references in ChatGPT. For example, type “Movies for Gen Z”, you will get a list of the most popular movies that appeal to the Gen Z demographic.

Learn specific things of a difficult topic
You can also learn any particular detail of a difficult topic using ChatGPT.

These applications represent just a glimpse into the diverse possibilities offered by ChatGPT's superpowers. As technology continues to advance, the potential for leveraging ChatGPT in innovative and impactful ways will only expand, benefiting individuals, businesses, and society as a whole.

None of these things are replacements of existing technology or human beings. But these all are applications which human beings will use, which technologies will use. In this way, ChatGPT will enable both technology and human beings to perform efficiently.

ChatGPT or any other technology like ChatGPT will not be something which will be impediment or disruptive to your career progress. It will be helping you becoming a smarter person. It will help you becoming a smarter professional and a more efficient individual. And, I have maintained this stance from the last 10-15 years of my adulthood days that technology is always an enabler and stimulant, it can never replace human beings, it can never replace old technologies. There are people who use low-tech or old technologies because it is easy for them to use. But yes, given time everything dies because everything comes with an expiry date.

Artificial Intelligence & Machine Learning's Most Important Concepts for Interviews

Let’s learn the most important concepts of Artificial Intelligence and Machine Learning through alphabets!

ANN
Artificial Neural Network (ANN) mimics the human brain’s neural network. Hidden units of the hidden layer in the artificial neural network architecture can be thought of as Neurons and they introduce non-linearity with inputs and few biases.

Bagging

Bagging means making multiple training sets from a training set and training separate models on those separate training sets and finally averaging them out.

Correlation
Correlation measures the relationship between two features or variables.

Deep Learning
The cornerstone of deep learning is artificial neural networks. Whenever you have artificial neural network with many hidden layers, its most likely a deep learning problem which is being solved.

Error
In a supervised machine learning setup, you always have an actual number on which you are supervising the model and a predicted and an estimated number and the difference between these two numbers is error. You tend to minimise the error in these kinds of machine learning problems which are a kind of supervised machine learning problems.

Feature
Feature or variable or input, all it means is what you feed into your data science models.

Gradient Descent
Gradient descent is a way to approximate global minima using local minima following an Iterative approach.

Hypothesis Testing
Using test statistic, you either reject the null hypothesis in the favour of alternative or fail to reject the null.

Intercept
Intercept in linear regression is that parameter which has the same impact for all inputs. For example, if your intercept is 3, the impact of intercept on y will always be 3, irrespective of x.

Julia
Julia is the next breakthrough in data science programming, as it is specially designed for quantitative researchers or data scientists.

KNN
K-Nearest Neighbours or KNN is the method which can be adopted to both regression and classification. The key idea here is that here you find K nearest points of any data point, and using this create a new model. And most importantly, this is a non-parametric model, here K is the hyper-parameter.

Linear Regression
Linear Regression is the simplest, yet the most powerful predictive ML Model. Its components are the intercept and coefficients of all the features or predictors.

Model
Model is an abstraction or simplification of reality using mathematics and statistics.

Normal Distribution
People generally talk about normal distribution when they are new to data science. It is not the only statistical distribution of importance. There are others as well. To completely identify a normal distribution, you need its mean and variance. And the heuristics to understand whether a distribution is normal or not, is that it will be symmetric about its mean with bell shape.

Overfitting
Overfitting is prone to happen when the model is too complex, fits the training data well but fails miserably when presented with new data.

P- Value
P-value measures the strength of rejection or failure to rejection. In statistical terms, it is the observed level of significance.

Q-Q Plot
Q-Q Plot compares two probability distributions. It is a Test for Normality.

Random Forest
Random forest combines the number of decision trees similar to bagging with a twist on the choice of features while making the decision trees.

SVD
SVD along with PCA is a dimension reduction technique and if you feed into it p features, when p is very large, it gives you a smaller set of modified features. In PCA terms, most important principal components which helps you in reducing the dimensionality of the problem.

T Test
T-test in statistical terms is a statistical test to compare means of two different distributions.

Underfitting
Underfitting is the opposite of overfitting. It is a deficient model which has missed out on important patterns or features while training a model.

Variance
Variance measures how spread the data is about the mean. So, if a data set is 2, 2, 2, 2…., it will definitely have a very low variance, which in this case is zero. And, if let’s say, data is 2, 5, 10, 15, 30, definitely its variance is going to be very high and not zero.

Web Scraping
Web Scraping is a risky way to get data from websites. Hence, read all the possible violations while doing web scraping.

X (Inputs)
X is what you now.

Y (Outputs)
Y is what you need to know.

Z-Score
Z-Score measures how far data point is from the mean. In terms of standard deviation, just subtract the mean from the data point and divide it by the standard deviation.

Wednesday, February 8, 2023

3 Tips for Finding your Data Science Project Idea

Projects are an essential part of learning data science. And, deciding on a topic for a project is a tough nut to crack. Here I have shared my experience & a three-step formula to find a personal project idea.

If you have decided to work on a project, think like a doctor. What does that mean?

Think about what are the things that can improve your day-to-day life with the help of data. You'll come across a lot of problems. And one such problem is your data science project.

Think of a problem you might be facing that can somehow be connected with data. Now ask the question, can you solve it with data?

If your answer is no. You can learn & revisit more topics in Data Science.

If your answer is yes, it means you have hit the right target and you are thinking in a suitable direction.
If still, you are not sure how you'll solve the problem i.e. what data-driven approach should be followed to solve the problem statement, you can follow the following 3 steps technique after figuring out the problem statement

1) Approach: You have to figure out what kind of method you should prefer to solve the problem. Does it require machine learning, mathematical programming, mathematical analysis or something more advanced

2) Data: Next step is to find the relevant data according to your problem. Is it available on the internet, or do you need to scrape it? Is the data structured or unstructured, and how do you clean and preprocess the data? These are some questions that you need to ask yourself to get quality data for your problem statement.

3) Result: Once you get the data and you have applied the chosen approach, it's time to present the solution to the general audience. Writing a detailed report of your findings is the best way to present your project. The report also helps the key persons to understand your project without going through each line of code.

This 3 step technique does wonders not only for personal projects but also for professional projects.

Let's understand by an example how to use this 3 step approach to a given problem statement.

Problem: How to increase subscribers of a YouTube channel?
(Disclaimer:
The youtube algorithm is far more advanced than the solution presented here. The solution is just to understand the strategy)

1. Approach: This problem may require machine learning techniques like Regression, Random forest, or advanced techniques like Neural Networks.

2. Data: There are many ways to get the data of a YouTube channel. We can use the YouTube API, or we can ask the owner of channel.

3. Result: The way we present the result is the most important thing. Saying something like, "make quality content for a youtube channel" as a result might not be an effective answer to the given problem statement.

The result should be specific, actionable, and personalised. Mentioning something like, "Posting twice a week, replying to all comments, uploading videos in the evening, etc are the key insights from the data & analysis, that have increased the engagement & retention in the past, so following these tips will help in gaining subscribers.'' will be very effective.

So, the overall summary for finding a suitable data science project is:
Look for problems - Can the problems be solved with the help of data? - Solve them using 3 step technique (Approach - Data - Result)

Sunday, January 15, 2023

Which One is Better for Data Science : IIT or ISI | Indian Statistical Institute Vs Indian Institute of Technology

Data Science, which is termed as the sexiest job of the 21st century has gained a lot of traction and eyeballs in the last few years. More and more people are trying to enter this field. And, to cope with the supply & demand, various institutes have started offering multiple programs related to Data Science, Machine Learning, or Artificial Intelligence. Two such premier institutes are IIT and ISI. These are known for their elite pedigree. But which one to prefer over the another? There are multiple deciding factors like location of the Institution, type of crowd, exposure, etc. We have focused on the curriculum/subjects, as it is the most important deciding factor.

Roughly speaking, we can divide the whole Data Science & Machine Learning work into two parts: Tech-focused and Statistics centric.

Tech-focused: This part of Data Science involves dealing with large sets of data, finding patterns in data using Machine Learning, AI models, and deploying the model in production. Coding or Programming is the key thing with an understanding of technology.

Examples include Advance Machine Learning, Deep Learning like Neutral Networks, etc.

Statistics Centric: This part of Data Science involves more of the explainable models, where most of the things depend on the parameters of the models. It involves the estimates of parameters which explain the complete model. Statistics & Maths are the key areas.

Examples include linear regression, statistical inference, time series forecasting, etc.

Disclaimer: Both ways are chosen individually or simultaneously for the given problem in different industries and different use cases.

Do you enjoy engineering more, or do you want theoretical studies more? That is the one question that should help you decide above anything else.

IITs are known to focus on engineering & technology whereas ISI has primarily a theme of Statistics.

If working in technology and coding appeals to you, then you can prefer IITs for Data Science. IITs are the best place to learn, work and implement technology with the brightest minds.

If you have an interest in numbers, and you love mathematics and its application in the real world, then ISI is for you. ISI is the best place to understand the Statistics & Maths behind Machine Learning. You get a chance to understand explainable ML, and analysis part of Data Science & how it can help the business.

ISI primarily focuses on traditional statistics & maths. Coding, deployment, and application are the added layers. The situation is the other way round in IITs.

It's not like IITians never learn mathematics or statistics and ISI students never do coding. It's about the curriculum, environment, and culture that differentiates.

Data Science is always a combination of Statistics, Programming, and Maths. Different institutes have different routes for this journey, giving priority to one section over another. At last, it boils down to your choice & preference.

Still, if you are confused now, the thing is you won't go wrong with either institute as both IITs and ISI are wonderful places to become Data scientists.

Saturday, March 12, 2022

AI Vs ML Vs Deep Learning Vs Data Science

So, what really is the difference amongst AI, ML, DL & Data Science?

This fundamentally is a marketing question. A better way to circumvent this question, is to define a spectrum for data science approaches.

At one end you will have Statistics, the other end will be occupied by software engineering. See below.

Let me offer an explanation to the above diagram. When you are to the left, you will see a lot of core data approaches, working with spreadsheets, data cleaning & wrangling, plotting histograms & fitting distributions, etc. This is the area where you will need to use your Maths/Stat skills like Hypothesis testing, measures of dispersion, etc. This I call 'Statistical Data Science'.
As you look towards right, you will find a lot of new age software driven data hungry methods, which are validation metric oriented.

So, it's like the same kind of solution with varying degree of software-maths-statistics usage. Towards the left, the proportion of Maths/Statistics is more, and the right is tilted towards software engineering.

Let's look at the same thing with a Venn diagram:

Also, there are things like Robotics & AGI which get mixed with Data Science. There is a need to understand that Data Science is largely a statistical discipline and Robotics has more Electronics/Mechanical Engineering.

Also, I have talked about these things in a youtube video. Check it out below: