Here are five tricky examples that showcase some advanced techniques for data analysis using pandas.
#1 Dealing with datetime data
import pandas as pd
# Convert a column to datetime format
data[‘date_column’] = pd.to_datetime(data[‘date_column’])
# Extract components from datetime (e.g., year, month, day)
data[‘year’] = data[‘date_column’].dt.year
data[‘month’] = data[‘date_column’].dt.month
# Calculate the time difference between two datetime columns
data[‘time_diff’] = data[‘end_time’] — data[‘start_time’]
#2 Working with text data
# Convert text to lowercase
data[‘text_column’] = data[‘text_column’].str.lower()
# Count the occurrences of specific words in a text column
data[‘word_count’] = data[‘text_column’].str.count(‘word’)
# Extract information using regular expressions
data[‘extracted_info’] = data[‘text_column’].str.extract(r’(\d+)’)
#3 Handling large datasets efficiently
# Read a large dataset in chunks
chunk_size = 100000
data_chunks = pd.read_csv(‘large_data.csv’, chunksize=chunk_size)
# Process data in chunks
for chunk in data_chunks:
# Perform calculations or manipulations on each chunk
# Append data from multiple files
file_list = [‘file1.csv’, ‘file2.csv’, ‘file3.csv’]
combined_data = pd.concat([pd.read_csv(file) for file in file_list])
#4 Pivot tables and reshaping data
# Create a pivot table
pivot_table = data.pivot_table(values=’column2', index=’column1', columns=’column3', aggfunc=’mean’)
# Unstack a multi-index DataFrame
unstacked_data = pivot_table.unstack().reset_index()
# Melt a DataFrame from wide to long format
melted_data = pd.melt(data, id_vars=[‘id’], value_vars=[‘var1’, ‘var2’], var_name=’variable’, value_name=’value’)
#5 Efficient memory usage
# Optimize memory usage of DataFrame columns
data[‘numeric_column’] = pd.to_numeric(data[‘numeric_column’], downcast=’integer’)
data[‘category_column’] = data[‘category_column’].astype(‘category’)
# Load a subset of columns from a large dataset
selected_columns = [‘column1’, ‘column2’, ‘column3’]
data_subset = pd.read_csv(‘large_data.csv’, usecols=selected_columns)
These examples demonstrate more advanced techniques for handling datetime data, text data, large datasets, reshaping data, and optimizing memory usage. They highlight some of the powerful features that pandas provide for complex data analysis tasks.
Related