lunax.data_processing

This module provides utility functions for data loading, splitting, and preprocessing operations.

load_data(file_path: str) → pd.DataFrame

Load tabular data from a file into a DataFrame.

Parameters:

file_path (str) – Path to the data file (supports csv, parquet, xlsx, xls)

Returns:

Loaded data as DataFrame

Return type:

pd.DataFrame

Raises:

ValueError – If file format is not supported
Exception – If data loading fails

split_data(df: pd.DataFrame, target: str, test_size: float = 0.2, random_state: int = 42) → Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

Split dataset into training and validation sets.

Parameters:

df (pd.DataFrame) – Input DataFrame
target (str) – Name of the target column
test_size (float) – Proportion of the dataset to include in the validation split
random_state (int) – Random seed for reproducibility

Returns:

X_train, X_val, y_train, y_val

Return type:

Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

preprocess_data(df: pd.DataFrame, target: str = None, numeric_strategy: str = 'mean', category_strategy: str = 'most_frequent', scale_numeric: bool = True, encode_categorical: bool = True) → pd.DataFrame

Perform data preprocessing including missing value handling, encoding, and standardization.

Parameters:

df (pd.DataFrame) – Input DataFrame
target (str, optional) – Target column name (if any)
numeric_strategy (str) – Strategy for filling numeric missing values (‘mean’ or ‘median’)
category_strategy (str) – Strategy for filling categorical missing values (‘most_frequent’)
scale_numeric (bool) – Whether to standardize numeric features
encode_categorical (bool) – Whether to encode categorical features

Returns:

Preprocessed DataFrame

Return type:

pd.DataFrame

Features:

Handles both numeric and categorical features
Supports missing value imputation
Performs feature scaling (standardization)
Provides label encoding for categorical variables
Preserves original data by working on a copy

Example Usage

from lunax.data_processing.utils import preprocess_data

# Load your data
df = pd.DataFrame(...)

# Preprocess with default settings
processed_df = preprocess_data(df, target='target_column')

# Customize preprocessing
processed_df = preprocess_data(
    df,
    target='target_column',
    numeric_strategy='median',
    scale_numeric=False,
    encode_categorical=True
)