🏠 Welcome to Data Adda! In Part 2 of the House Price Prediction Project Series, we dive into the most important stage of any Data Science project: Exploratory Data Analysis (EDA) and Data Understanding. Before building Machine Learning models, every Data Scientist must understand the dataset, identify patterns, detect anomalies, and uncover relationships between features. 📚 Topics Covered: ✅ Loading and Understanding the Dataset ✅ Dataset Overview ✅ Feature Identification ✅ Target Variable Analysis ✅ Exploratory Data Analysis (EDA) ✅ Descriptive Statistics ✅ Missing Value Analysis ✅ Outlier Detection ✅ Distribution Analysis ✅ Correlation Analysis ✅ Feature Relationships ✅ Data Visualization using Matplotlib & Seaborn 🏠 Features Analyzed: • Location • Area (Square Feet) • Bedrooms • Bathrooms • Parking • Property Age • Price 📊 Visualizations Covered: ✔ Histogram ✔ Box Plot ✔ Scatter Plot ✔ Correlation Heatmap ✔ Pair Plot ✔ Distribution Plot 🎯 What You Will Learn: • How to understand a dataset like a Data Scientist • How to identify important features • How to detect data quality issues • How to find relationships between variables • How to prepare data for Machine Learning 🔥 Why EDA is Important? Most machine learning failures happen because of poor understanding of data. EDA helps us discover hidden patterns before building models. This is Part 2 of the Complete House Price Prediction Series. 📌 Upcoming Videos: Part 3 → Data Cleaning & Feature Engineering Part 4 → Model Building Part 5 → Model Evaluation Part 6 → Model Deployment Perfect for: ✅ Data Science Beginners ✅ Machine Learning Students ✅ Analytics Professionals ✅ Interview Preparation ✅ Portfolio Projects Subscribe to Data Adda for practical Data Science, Machine Learning, AI, Statistics, and GenAI tutorials. #HousePricePrediction #EDA #DataScienceProject #MachineLearningProject #DataAdda #ExploratoryDataAnalysis #Python #DataScience
Comments 2
Sign in to join the conversation
Sign in
PropertyType Area Bedrooms Bathrooms Age Parking LocationScore HouseID OwnerPhone RandomCode Price Normal House 1000 2 1 10 1 5 101 9871 45 50 Normal House 1200 2 2 8 1 6 102 9872 88 55 Normal House 1500 3 2 5 1 7 103 9873 12 70 Normal House 1800 3 3 1 8 104 9874 67 80 Normal House 2000 4 3 3 2 9 105 9875 90 90 Luxury Villa 10000 10 8 1 8 10 106 9876 500 1500 Normal House 5 4 2 2 9 107 9877 11 110 Normal House 2500 5 4 2 2 9 107 9877 11 110
# Professional House Price Prediction Training Notebook Based on the uploaded 22-page document. Includes all steps, explanations, and code workflow. ## STEP 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score ## STEP 2: Create Dataset df = pd.read_csv("data/linear_regression_sample_data.csv") ## Business Understanding 1500 is an outlier but represents a real Luxury Villa business case and should not be blindly removed. ## STEP 3: Understand Dataset df.info() df.describe() ## STEP 4: Handle Missing Values df.isnull().sum() df['Area']=df['Area'].fillna(df['Area'].mean()) df['Age']=df['Age'].fillna(df['Age'].mean()) df ## STEP 5: Handle Duplicate Rows df.duplicated().sum() df.drop_duplicates(inplace=True) df ## STEP 6: Handle Outliers sns.boxplot(x=df['Price']) plt.title('Price Outliers') plt.show() Q1=df['Price'].quantile(0.25) Q3=df['Price'].quantile(0.75) IQR=Q3-Q1 lower_limit=Q1-1.5*IQR upper_limit=Q3+1.5*IQR print(Q1,Q3,IQR,lower_limit,upper_limit) outliers=df[(df['Price']<lower_limit)|(df['Price']>upper_limit)] outliers Industry Decision: Keep Luxury Villa outlier because it is valid business data. df['LogPrice']=np.log(df['Price']) df[['Price','LogPrice']] ## STEP 7: EDA sns.histplot(df['Price']); plt.title('Price Distribution'); plt.show() sns.histplot(df['LogPrice']); plt.title('Price Distribution'); plt.show() ## STEP 8: Bivariate Analysis sns.scatterplot(x=df['Area'],y=df['Price']); plt.show() sns.scatterplot(x=df['Bedrooms'],y=df['Price']); plt.show() sns.scatterplot(x=df['Age'],y=df['Price']); plt.show() ## STEP 9: Analyze Unwanted Features sns.scatterplot(x=df['RandomCode'],y=df['Price']); plt.show() Observation: No meaningful relationship. ## STEP 10: Correlation Analysis df['PropertyType']=df['PropertyType'].map({'Normal House':0,'Luxury Villa':1}) correlation=df.corr(numeric_only=True) correlation sns.heatmap(correlation,annot=True,cmap='Blues'); plt.show() ## STEP 11: Feature Selection X=df[['PropertyType','Area','Bedrooms','Bathrooms','Age','Parking','LocationScore']] y=df['LogPrice'] print(X.isnull().sum()) ## STEP 12: Feature Scaling scaler=StandardScaler() X_scaled=scaler.fit_transform(X) ## STEP 13: Train Test Split X_train,X_test,y_train,y_test=train_test_split(X_scaled,y,test_size=0.2,random_state=42) ## STEP 14: Create ML Model model=LinearRegression() ## STEP 15: Train Model model.fit(X_train,y_train) ## STEP 16: Make Predictions y_pred=model.predict(X_test) print(y_pred) ## STEP 17: Evaluate Model mae=mean_absolute_error(y_test,y_pred) rmse=np.sqrt(mean_squared_error(y_test,y_pred)) r2=r2_score(y_test,y_pred) print(mae,rmse,r2) # 📊 Understanding Regression Error Metrics ## Why Use Multiple Metrics? Each metric measures a different aspect of model performance. | Metric | Purpose | |----------|----------| | MAE | Average mistake | | RMSE | Punishes large mistakes | | R² | Overall model quality | --- ## 1. MAE (Mean Absolute Error) Measures the average prediction error. ### Interpretation ```text MAE = 8 ``` Means: ```text On average, the model prediction is off by 8 units. ``` ### Business Example ```text MAE = ₹8 lakh ``` Means: ```text House price predictions are wrong by ₹8 lakh on average. ``` --- ## 2. RMSE (Root Mean Squared Error) Measures prediction error while heavily penalizing large mistakes. ### Interpretation ```text RMSE = 12 ``` Means: ```text The model makes some large prediction errors. ``` ### Why RMSE Matters RMSE gives more weight to large errors. Example: | Error | |---------| | 1 | | 2 | | 3 | | 20 | The error of 20 has a much bigger impact on RMSE than MAE. --- ## 3. R² Score (Coefficient of Determination) Measures how much variation in the target variable is explained by the model. ### Interpretation ```text R² = 0.90 ``` Means: ```text The model explains 90% of the variation in house prices. ``` ### R² Scale | R² Value | Interpretation | |-----------|---------------| | 1.0 | Perfect Model | | 0.9+ | Excellent | | 0.8+ | Very Good | | 0.7+ | Good | | 0.5+ | Moderate | | 0 | No Learning | | < 0 | Worse Than Guessing | --- # 🏢 Real Industry Evaluation Suppose we have two models. ## Model A | Metric | Value | |----------|----------| | MAE | 8 | | RMSE | 12 | | R² | 0.90 | --- ## Model B | Metric | Value | |----------|----------| | MAE | 6 | | RMSE | 8 | | R² | 0.95 | --- # 📈 Model Comparison | Metric | Better Model | |----------|----------| | MAE | Model B | | RMSE | Model B | | R² | Model B | --- # ✅ Decision **Model B is better** ### Reasons - Lower MAE - Lower RMSE - Higher R² ### Interpretation ```text Model B makes fewer prediction errors, has fewer large mistakes, and explains more variation in the data. ``` Therefore: ```