Frequently Asked Questions About Edureka AI/ML Foundations Skill
22 answers covering everything from basics to advanced usage.
// Basics
What is Artificial Narrow Intelligence and why does it matter?
Artificial Narrow Intelligence (ANI), also called Weak AI, is the only stage of AI that currently exists in production. ANI systems perform a single narrowly defined task—like voice recognition (Siri, Alexa) or playing chess (Deep Blue). Every AI system you build today is ANI. It matters because over-claiming AGI or ASI capabilities is technically incorrect and sets false expectations with stakeholders, regulators, and users.
What are the four functional types of AI?
The four types are: Reactive Machines (no memory, operate only on present data, e.g., Deep Blue), Limited Memory AI (use recent historical data for decisions, e.g., self-driving cars), Theory of Mind AI (comprehend emotions and beliefs—still in research), and Self-Aware AI (possess consciousness—purely hypothetical). Current production systems are either Reactive Machines or Limited Memory. Identifying your system's type sets correct expectations for what it can and cannot do.
What is the Turing Test and is it still relevant?
The Turing Test, proposed by Alan Turing in 1950, evaluates whether a machine can exhibit intelligent behavior indistinguishable from a human through text-based conversation. It remains a foundational concept in AI philosophy but is not used as a practical engineering benchmark. Modern AI evaluation relies on task-specific metrics like accuracy, F1-score, and BLEU scores rather than the Turing Test.
What is feature engineering and when should I do it manually?
Feature engineering is the process of using domain knowledge to select, create, and transform input variables to improve model performance. In classical ML, feature engineering is always manual and critical—it directly determines model quality. In deep learning, the algorithm automatically learns features from raw data, so manual feature engineering is unnecessary and wastes effort. Always clarify which paradigm you're using before deciding whether to invest time in feature engineering.
What is data splicing in machine learning?
Data splicing is the process of dividing your dataset into a training set (used to build the model) and a testing set (used only for evaluation). The training set is always larger—commonly 70-80% of the data. The testing set is 20-30%. This separation ensures the model is evaluated on data it has never seen, giving an honest estimate of real-world performance. Never train on test data; doing so produces falsely optimistic accuracy.
What is the Apriori algorithm and when should I use it?
The Apriori algorithm is an unsupervised learning method used for association analysis—discovering which items frequently co-occur. Its most common application is market basket analysis (e.g., customers who buy bread also buy butter). Use it when you have transactional data and want to find item co-occurrence patterns to inform cross-selling, product placement, or recommendation strategies. It falls under unsupervised learning because there is no target variable to predict.
// How To
How do I decide between regression and classification?
Check your target variable. If it is a continuous quantity (e.g., predicting house prices, temperature, revenue), you have a regression problem. If it is a categorical label (e.g., spam/not spam, delayed/on time, disease/no disease), you have a classification problem. Both fall under supervised learning because they require labeled training data. This distinction determines which algorithms are candidates.
How do I handle missing values during data preparation?
Missing values must be addressed before model training because they cause wrongful computation. Common strategies include: dropping rows or columns with excessive missingness, imputing with mean/median for numerical features, imputing with mode for categorical features, or using more advanced methods like KNN imputation. Use Pandas' isnull() and fillna() methods. Always document how many values were missing and what strategy you applied, as this affects model validity.
How do I split data into training and testing sets correctly?
Use Scikit-Learn's train_test_split function. The training set should be 70-80% of your data and is used to build the model. The testing set (20-30%) is used only for evaluation. Never train on test data—this produces falsely optimistic accuracy. For time-series data, split chronologically rather than randomly. For classification with imbalanced classes, use stratified splitting to maintain class proportions in both sets.
How do I perform Exploratory Data Analysis before building a model?
EDA is the brainstorming stage. Start by examining feature distributions with histograms and box plots. Calculate correlation matrices to identify strong predictors of your target variable. Check for class imbalance in classification problems. Use scatter plots and pair plots to visualize relationships between features. Flag outliers. EDA insights directly inform model design—skipping it means building blind, which degrades model quality.
// Troubleshooting
My model accuracy is high on training data but low on test data—what's wrong?
This is overfitting—your model memorized the training data instead of learning generalizable patterns. Common fixes include: reducing model complexity (fewer features or simpler algorithm), applying regularization (L1/L2 penalties), increasing training data volume, using cross-validation instead of a single train-test split, and applying early stopping for neural networks. Always compare training and testing accuracy to detect this issue early.
Why is my deep learning model performing worse than a simple Decision Tree?
Deep learning requires large volumes of data to outperform classical ML. On small datasets, deep learning models have too many parameters relative to the data available, leading to overfitting and poor generalization. Classical algorithms like Decision Trees or Random Forest will win on small data. Switch to deep learning only when you have substantially more data (typically tens of thousands of samples minimum) and GPU hardware to train efficiently.
My model takes too long to train—what should I do?
Long training times are expected for deep learning models—large neural networks can take weeks from scratch. First confirm you have GPU access (deep learning on CPU is impractically slow). Reduce input dimensionality through feature selection. Use pre-trained models and transfer learning instead of training from scratch. For classical ML, ensure you're not using unnecessarily complex algorithms—try simpler ones like Logistic Regression or Naive Bayes first.
How do I know if my dataset is too small for machine learning?
There is no universal minimum, but guidelines help. For classical ML, you generally need at least 10 times as many observations as features for regression, and enough samples per class for classification (a common heuristic is 50-100 minimum per class). For deep learning, you typically need tens of thousands of samples. If your dataset is small, prefer simple algorithms (Logistic Regression, Naive Bayes), apply cross-validation, and consider data augmentation or transfer learning.
// Comparisons
How does supervised learning compare to reinforcement learning?
Supervised learning trains on a static dataset with labeled input-output pairs—the correct answer is provided for each example. Reinforcement learning has no predefined dataset at all; instead, an agent interacts with an environment, takes actions, and learns from reward signals through trial and error. Use supervised learning when you have labeled historical data. Use reinforcement learning for sequential decision-making problems like game-playing, robotics, or autonomous navigation.
How does K-Means clustering compare to classification algorithms?
K-Means is an unsupervised algorithm that groups data points into K clusters based on feature similarity without any predefined labels. Classification algorithms (Logistic Regression, SVM, KNN) are supervised and require labeled data to predict predefined categories. Use K-Means when you don't know the categories in advance and want the algorithm to discover natural groupings. Use classification when you have labeled examples and want to assign new data to known categories.
How does Scikit-Learn compare to TensorFlow for machine learning?
Scikit-Learn is the standard library for classical machine learning—regression, classification, clustering, and preprocessing. It runs on CPU and is fast for small to medium datasets. TensorFlow is designed for deep learning—building and training neural networks on GPUs. Use Scikit-Learn for classical ML algorithms and rapid prototyping. Use TensorFlow when your problem requires deep learning (CNNs for images, RNNs for sequences) and you have large data and GPU infrastructure.
// Advanced
Can I use this framework for natural language processing tasks?
Yes. NLP tasks map onto the same framework: sentiment analysis is supervised classification, topic modeling is unsupervised clustering, text generation can involve deep learning. Use NLTK for text preprocessing and feature extraction. For classical NLP, apply Naive Bayes or SVM on TF-IDF features. For complex tasks like translation or summarization, deep learning with TensorFlow/Keras is appropriate. The seven-step process applies identically—data preparation for text is especially critical.
How do I handle interpretability requirements in regulated industries?
In regulated or high-stakes industries (healthcare, finance, criminal justice), model interpretability is often legally required. Prefer Decision Trees or Logistic Regression—they produce crisp, inspectable rules that explain exactly why a decision was made. Avoid deep learning black-box models in these contexts even if they offer higher accuracy. If you must use complex models, apply techniques like SHAP or LIME for post-hoc explanations, and document the interpretability limitation for stakeholders.
What ethical considerations should I address before deploying an ML model?
Before deployment, audit your training data for bias—models trained on incomplete or unrepresentative datasets reproduce and amplify those biases. Document data provenance, missing categories, and known limitations. In cybersecurity AI, incomplete training data can cause false positive alerts and alert fatigue. In healthcare, biased models can lead to disparate treatment outcomes. Flag regulatory compliance requirements (GDPR, HIPAA). Always disclose that your system is ANI with specific, bounded capabilities.
Can I apply this framework to computer vision problems?
Yes. Computer vision tasks map cleanly: image classification is supervised classification (use CNNs), object detection is an end-to-end deep learning task (use YOLO), and image segmentation can be classification at the pixel level. For computer vision, deep learning is almost always the right choice because the data volume is typically large and the algorithm must learn hierarchical visual features automatically. Confirm GPU availability and plan for extended training times.
What is Q-Learning and how does it relate to AlphaGo?
Q-Learning is the foundational reinforcement learning algorithm where an agent learns an optimal action policy by maximizing cumulative reward through trial and error. It maintains a Q-table mapping state-action pairs to expected rewards. AlphaGo extended this logic using deep neural networks (Deep Q-Networks) to handle the enormous state space of the board game Go. Q-Learning is appropriate for sequential decision-making problems where the agent learns from environmental interaction, not from a static dataset.