lunax.data_processing
This module provides utility functions for data loading, splitting, and preprocessing operations.
- load_data(file_path: str) pd.DataFrame
Load tabular data from a file into a DataFrame.
- Parameters:
file_path (str) – Path to the data file (supports csv, parquet, xlsx, xls)
- Returns:
Loaded data as DataFrame
- Return type:
pd.DataFrame
- Raises:
ValueError – If file format is not supported
Exception – If data loading fails
- split_data(df: pd.DataFrame, target: str, test_size: float = 0.2, random_state: int = 42) Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]
Split dataset into training and validation sets.
- Parameters:
- Returns:
X_train, X_val, y_train, y_val
- Return type:
Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]
- preprocess_data(df: pd.DataFrame, target: str = None, numeric_strategy: str = 'mean', category_strategy: str = 'most_frequent', scale_numeric: bool = True, encode_categorical: bool = True) pd.DataFrame
Perform data preprocessing including missing value handling, encoding, and standardization.
- Parameters:
df (pd.DataFrame) – Input DataFrame
target (str, optional) – Target column name (if any)
numeric_strategy (str) – Strategy for filling numeric missing values (‘mean’ or ‘median’)
category_strategy (str) – Strategy for filling categorical missing values (‘most_frequent’)
scale_numeric (bool) – Whether to standardize numeric features
encode_categorical (bool) – Whether to encode categorical features
- Returns:
Preprocessed DataFrame
- Return type:
pd.DataFrame
Features:
Handles both numeric and categorical features
Supports missing value imputation
Performs feature scaling (standardization)
Provides label encoding for categorical variables
Preserves original data by working on a copy
Example Usage
from lunax.data_processing.utils import preprocess_data # Load your data df = pd.DataFrame(...) # Preprocess with default settings processed_df = preprocess_data(df, target='target_column') # Customize preprocessing processed_df = preprocess_data( df, target='target_column', numeric_strategy='median', scale_numeric=False, encode_categorical=True )