Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| import pandas as pd | |
| import numpy as np | |
| import matplotlib.pyplot as plt | |
| import seaborn as sns | |
| import io | |
| import sys | |
| from contextlib import redirect_stdout | |
| # Initialize session state for notebook-like cells | |
| if 'cells' not in st.session_state: | |
| st.session_state.cells = [] | |
| if 'df' not in st.session_state: | |
| st.session_state.df = None | |
| def capture_output(code, df=None): | |
| """Helper function to capture print output""" | |
| f = io.StringIO() | |
| with redirect_stdout(f): | |
| try: | |
| # Create a dictionary of variables to use in exec | |
| variables = {'pd': pd, 'np': np, 'plt': plt, 'sns': sns} | |
| if df is not None: | |
| variables['df'] = df | |
| exec(code, variables) | |
| except Exception as e: | |
| return f"Error: {str(e)}" | |
| return f.getvalue() | |
| def show(): | |
| st.title("Week 3: Data Cleaning and Exploratory Data Analysis") | |
| # Section 1: Introduction to EDA | |
| st.header("1. Introduction to Exploratory Data Analysis") | |
| st.markdown(""" | |
| Exploratory Data Analysis (EDA) is a crucial step in any data science project. Whether EDA is the main purpose of your project or is being used for feature selection/feature engineering in a machine learning context, it's important to understand the relationships between your features and target variables. | |
| In this module, we'll focus on: | |
| - Understanding categorical variables | |
| - Data cleaning techniques | |
| - Visualizing relationships in data | |
| - Identifying patterns and insights | |
| """) | |
| # Section 2: The Titanic Dataset | |
| st.header("2. Working with the Titanic Dataset") | |
| st.markdown(""" | |
| We'll use the famous Titanic dataset to demonstrate data cleaning and EDA techniques. This dataset contains information about passengers aboard the Titanic and whether they survived. | |
| ### Dataset Description | |
| | Variable | Definition | Key | | |
| | -------- | ---------- | --- | | |
| | survival | Survival | 0 = No, 1 = Yes | | |
| | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | |
| | sex | Sex | | | |
| | Age | Age in years | | | |
| | sibsp | # of siblings / spouses aboard | | | |
| | parch | # of parents / children aboard | | | |
| | ticket | Ticket number | | | |
| | fare | Passenger fare | | | |
| | cabin | Cabin number | | | |
| | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | | |
| """) | |
| # Load and display the dataset | |
| def load_data(): | |
| return pd.read_csv("https://raw.githubusercontent.com/hoffm386/eda-with-categorical-variables/master/titanic.csv") | |
| df = load_data() | |
| st.session_state.df = df | |
| st.subheader("Dataset Preview") | |
| st.dataframe(df.head()) | |
| # Interactive Data Loading Example | |
| st.subheader("Try loading the data yourself!") | |
| load_code = st.text_area("Try loading the Titanic dataset:", | |
| 'import pandas as pd\n\ndf = pd.read_csv("https://raw.githubusercontent.com/hoffm386/eda-with-categorical-variables/master/titanic.csv")\nprint(df.head())', | |
| height=100) | |
| st.code(load_code, language="python", line_numbers=True) | |
| if st.button("Run Data Loading Code"): | |
| output = capture_output(load_code, df) | |
| st.code(output, language="python", line_numbers=True) | |
| # Basic Dataset Information | |
| st.subheader("Dataset Information") | |
| st.markdown(""" | |
| Let's explore some basic information about our dataset. Try these commands: | |
| """) | |
| info_code = st.text_area("Try getting dataset information:", | |
| 'print("Dataset Shape:", df.shape)\nprint("\\nColumn Names:", df.columns.tolist())\nprint("\\nData Types:\\n", df.dtypes)\nprint("\\nMissing Values:\\n", df.isnull().sum())', | |
| height=150) | |
| st.code(info_code, language="python", line_numbers=True) | |
| if st.button("Run Info Code"): | |
| output = capture_output(info_code, df) | |
| st.code(output, language="python", line_numbers=True) | |
| # Section 3: Data Cleaning | |
| st.header("3. Data Cleaning Techniques") | |
| # Missing Value Handling | |
| st.subheader("Missing Value Analysis") | |
| st.markdown(""" | |
| Let's analyze and handle missing values in our dataset. Try these examples: | |
| """) | |
| missing_code = st.text_area("Try analyzing missing values:", | |
| 'missing_percent = (df.isnull().sum() / len(df)) * 100\nprint("Percentage of missing values:\\n", missing_percent[missing_percent > 0])\n\n# Try filling missing values\ndf_filled = df.copy()\ndf_filled["Age"].fillna(df_filled["Age"].median(), inplace=True)\nprint("\\nMissing values after filling Age:", df_filled["Age"].isnull().sum())', | |
| height=150) | |
| st.code(missing_code, language="python", line_numbers=True) | |
| if st.button("Run Missing Value Code"): | |
| output = capture_output(missing_code, df) | |
| st.code(output, language="python", line_numbers=True) | |
| # Data Type Conversion | |
| st.subheader("Data Type Conversion") | |
| st.markdown(""" | |
| Let's convert categorical variables to the appropriate data types: | |
| """) | |
| type_code = st.text_area("Try converting data types:", | |
| 'df_cat = df.copy()\ndf_cat["Sex"] = df_cat["Sex"].astype("category")\ndf_cat["Embarked"] = df_cat["Embarked"].astype("category")\nprint("Data types after conversion:\\n", df_cat.dtypes)', | |
| height=100) | |
| st.code(type_code, language="python", line_numbers=True) | |
| if st.button("Run Type Conversion Code"): | |
| output = capture_output(type_code, df) | |
| st.code(output, language="python", line_numbers=True) | |
| # Section 4: EDA with Categorical Variables | |
| st.header("4. EDA with Categorical Variables") | |
| # Interactive Visualizations | |
| st.subheader("Create Your Own Visualizations") | |
| st.markdown(""" | |
| Let's explore different types of visualizations to understand our data better: | |
| 1. **Basic Count Plots** | |
| First, let's look at the distribution of passengers by class and survival: | |
| """) | |
| viz_code = st.text_area("Try creating basic visualizations:", | |
| '''import matplotlib.pyplot as plt | |
| import seaborn as sns | |
| # Create a figure with two subplots | |
| fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5)) | |
| # Count plot for Sex | |
| sns.countplot(data=df, x="Sex", ax=ax1) | |
| ax1.set_title("Passenger Count by Sex") | |
| # Bar plot for survival rate by Pclass | |
| sns.barplot(data=df, x="Pclass", y="Survived", ax=ax2) | |
| ax2.set_title("Survival Rate by Passenger Class") | |
| plt.tight_layout() | |
| st.pyplot(fig)''', | |
| height=200) | |
| st.code(viz_code, language="python", line_numbers=True) | |
| if st.button("Run Basic Visualization Code"): | |
| output = capture_output(viz_code, df) | |
| st.pyplot(plt.gcf()) | |
| # Advanced Visualizations | |
| st.subheader("Advanced Visualizations") | |
| st.markdown(""" | |
| Now let's create more complex visualizations to understand relationships between variables: | |
| 2. **Survival Analysis by Class** | |
| Let's analyze survival rates across different passenger classes with a stacked bar chart: | |
| """) | |
| advanced_viz_code = st.text_area("Try creating advanced visualizations:", | |
| '''import matplotlib.pyplot as plt | |
| import seaborn as sns | |
| from matplotlib.patches import Patch | |
| # Create figure and axis | |
| fig, ax = plt.subplots(figsize=(10, 6)) | |
| # Create countplot with custom colors | |
| sns.countplot(x="Pclass", hue="Survived", data=df, | |
| palette={1: "blue", 0: "red"}, ax=ax) | |
| # Customize the plot | |
| ax.set_xlabel("Passenger Class") | |
| ax.set_title("Survival Distribution by Passenger Class") | |
| # Create custom legend | |
| legend_elements = [ | |
| Patch(facecolor="blue", label="Survived"), | |
| Patch(facecolor="red", label="Did Not Survive") | |
| ] | |
| ax.legend(handles=legend_elements) | |
| plt.tight_layout() | |
| st.pyplot(fig) | |
| # Create a second figure for percentage analysis | |
| fig2, ax2 = plt.subplots(figsize=(10, 6)) | |
| # Calculate percentages | |
| survival_by_class = df.groupby("Pclass")["Survived"].value_counts(normalize=True).unstack() | |
| survival_by_class.plot(kind="bar", stacked=True, ax=ax2) | |
| # Customize the plot | |
| ax2.set_xlabel("Passenger Class") | |
| ax2.set_ylabel("Percentage") | |
| ax2.set_title("Survival Rate by Passenger Class") | |
| ax2.legend(title="Survived", labels=["No", "Yes"]) | |
| plt.tight_layout() | |
| st.pyplot(fig2)''', | |
| height=400) | |
| st.code(advanced_viz_code, language="python", line_numbers=True) | |
| if st.button("Run Advanced Visualization Code"): | |
| output = capture_output(advanced_viz_code, df) | |
| st.pyplot(plt.gcf()) | |
| # Age Distribution Analysis | |
| st.subheader("Age Distribution Analysis") | |
| st.markdown(""" | |
| 3. **Age Distribution by Survival** | |
| Let's examine how age relates to survival: | |
| """) | |
| age_viz_code = st.text_area("Try creating age distribution visualizations:", | |
| '''import matplotlib.pyplot as plt | |
| # Create figure and axis | |
| fig, ax = plt.subplots() | |
| # Plot histograms for survived and non-survived passengers | |
| ax.hist(df[df["Survived"]==1]["Age"], bins=15, alpha=0.5, color="blue", label="survived") | |
| ax.hist(df[df["Survived"]==0]["Age"], bins=15, alpha=0.5, color="green", label="did not survive") | |
| # Customize the plot | |
| ax.set_xlabel("Age") | |
| ax.set_ylabel("Count of passengers") | |
| ax.set_title("Age vs. Survival for Titanic Passengers") | |
| ax.legend() | |
| plt.tight_layout() | |
| st.pyplot(fig)''', | |
| height=200) | |
| st.code(age_viz_code, language="python", line_numbers=True) | |
| if st.button("Run Age Distribution Code"): | |
| output = capture_output(age_viz_code, df) | |
| st.pyplot(plt.gcf()) | |
| # Age and Fare Analysis | |
| st.subheader("Age and Fare Analysis") | |
| st.markdown(""" | |
| 4. **Survival by Age and Fare** | |
| Let's analyze how both age and fare relate to survival: | |
| """) | |
| age_fare_viz_code = st.text_area("Try creating age and fare visualizations:", | |
| '''import matplotlib.pyplot as plt | |
| from matplotlib.lines import Line2D | |
| # Create figure and axis | |
| fig, ax = plt.subplots(figsize=(10, 5)) | |
| # Plot scatter points for survived and non-survived passengers | |
| ax.scatter(df[df["Survived"]==1]["Age"], df[df["Survived"]==1]["Fare"], | |
| c="blue", alpha=0.5, label="survived") | |
| ax.scatter(df[df["Survived"]==0]["Age"], df[df["Survived"]==0]["Fare"], | |
| c="green", alpha=0.5, label="did not survive") | |
| # Customize the plot | |
| ax.set_xlabel("Age") | |
| ax.set_ylabel("Fare") | |
| ax.set_title("Survival by Age and Fare for Titanic Passengers") | |
| # Create custom legend | |
| color_patches = [ | |
| Line2D([0], [0], marker='o', color='w', label='survived', | |
| markerfacecolor='b', markersize=10), | |
| Line2D([0], [0], marker='o', color='w', label='did not survive', | |
| markerfacecolor='g', markersize=10) | |
| ] | |
| ax.legend(handles=color_patches) | |
| plt.tight_layout() | |
| st.pyplot(fig)''', | |
| height=250) | |
| st.code(age_fare_viz_code, language="python", line_numbers=True) | |
| if st.button("Run Age and Fare Visualization Code"): | |
| output = capture_output(age_fare_viz_code, df) | |
| st.pyplot(plt.gcf()) | |
| # Section 5: Hands-on Exercise | |
| st.header("5. Hands-on Exercise") | |
| st.markdown(""" | |
| ### Tasks for this week: | |
| 1. **Data Cleaning Exercise** | |
| - Load the dataset used for your research | |
| - Identify and handle missing values | |
| - Convert categorical variables | |
| - Create summary statistics | |
| 2. **EDA Analysis** | |
| - Create visualizations for key variables | |
| - Analyze relationships between variables | |
| - Identify patterns in survival rates | |
| 3. **Report Writing** | |
| - Document your findings | |
| - Create a presentation of key insights | |
| - Suggest potential next steps | |
| """) | |
| # Interactive Exercise | |
| st.subheader("Try Your Own Analysis") | |
| exercise_code = st.text_area("Write your own analysis code here:", | |
| '# Your code here\n# Try analyzing the relationship between Age and Survival\n# Or create your own visualizations\n# Or perform any other analysis you find interesting', | |
| height=150) | |
| st.code(exercise_code, language="python", line_numbers=True) | |
| if st.button("Run Exercise Code"): | |
| output = capture_output(exercise_code, df) | |
| st.code(output, language="python", line_numbers=True) | |
| # Section 6: Resources | |
| st.header("6. Homework This Week") | |
| st.markdown(""" | |
| 1. Please use your research dataset to complete the following tasks: | |
| - Analyze data for any missing values | |
| - Get basic information about the dataset (Hint use the [Dataset Information](#dataset-information) section above) | |
| - Create visualizations to understand the data | |
| - Hint use the [Create Your Own Visualizations](#create-your-own-visualizations) section above | |
| - Write a report of your findings and save the graphs produced | |
| - Your report should cover what you find interesting about the data | |
| - Possible research questions | |
| - Please submit your homework on WeChat | |
| """) | |
| # Section 7: Resources | |
| st.header("7. Additional Resources") | |
| st.markdown(""" | |
| - [EDA with Categorical Variables](https://github.com/hoffm386/eda-with-categorical-variables) | |
| - [Kaggle EDA Tutorial](https://www.kaggle.com/code/kashnitsky/topic-1-exploratory-data-analysis-with-pandas) | |
| - [Pandas Documentation](https://pandas.pydata.org/docs/) | |
| - [Seaborn Documentation](https://seaborn.pydata.org/) | |
| """) | |
| if __name__ == "__main__": | |
| show() | |