lunax.data_processing

This module provides utility functions for data loading, splitting, and preprocessing operations.

load_data(file_path: str) pd.DataFrame

Load tabular data from a file into a DataFrame.

Parameters:

file_path (str) – Path to the data file (supports csv, parquet, xlsx, xls)

Returns:

Loaded data as DataFrame

Return type:

pd.DataFrame

Raises:
split_data(df: pd.DataFrame, target: str, test_size: float = 0.2, random_state: int = 42) Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

Split dataset into training and validation sets.

Parameters:
  • df (pd.DataFrame) – Input DataFrame

  • target (str) – Name of the target column

  • test_size (float) – Proportion of the dataset to include in the validation split

  • random_state (int) – Random seed for reproducibility

Returns:

X_train, X_val, y_train, y_val

Return type:

Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

preprocess_data(df: pd.DataFrame, target: str = None, numeric_strategy: str = 'mean', category_strategy: str = 'most_frequent', scale_numeric: bool = True, encode_categorical: bool = True) pd.DataFrame

Perform data preprocessing including missing value handling, encoding, and standardization.

Parameters:
  • df (pd.DataFrame) – Input DataFrame

  • target (str, optional) – Target column name (if any)

  • numeric_strategy (str) – Strategy for filling numeric missing values (‘mean’ or ‘median’)

  • category_strategy (str) – Strategy for filling categorical missing values (‘most_frequent’)

  • scale_numeric (bool) – Whether to standardize numeric features

  • encode_categorical (bool) – Whether to encode categorical features

Returns:

Preprocessed DataFrame

Return type:

pd.DataFrame

Features:

  • Handles both numeric and categorical features

  • Supports missing value imputation

  • Performs feature scaling (standardization)

  • Provides label encoding for categorical variables

  • Preserves original data by working on a copy

Example Usage

from lunax.data_processing.utils import preprocess_data

# Load your data
df = pd.DataFrame(...)

# Preprocess with default settings
processed_df = preprocess_data(df, target='target_column')

# Customize preprocessing
processed_df = preprocess_data(
    df,
    target='target_column',
    numeric_strategy='median',
    scale_numeric=False,
    encode_categorical=True
)