Part 6: Data Scaling – Normalization vs. Standardization

In our previous part, we witnessed a "numerical explosion." Because our target formula involves x³, the raw values reached into the thousands, causing our loss to skyrocket into the millions. To build a stable neural network, we must bring these values into a range the model can handle.

In deep learning, this is achieved through Data Scaling. There are two primary methods used to tame raw data: Normalization and Standardization.

Normalization (Min-Max Scaling)

Normalization rescales the data into a fixed range, typically [0, 1]. It is like taking a rubber band and stretching or compressing it so that the smallest value is at 0 and the largest is at 1.

Here is the formula:

$$x_{norm}=\frac{x-x_{min}}{x_{max}-x_{min}}$$

Best For: Algorithms that do not assume a specific distribution of data (like K-Nearest Neighbors) or when you need a strictly bounded range.
The Catch: It is extremely sensitive to outliers. If you have one massive value in your dataset, it will "squish" all other meaningful data points into a tiny range near zero, making it hard for the model to distinguish between them.

💡

Outliers are data points that are significantly different (much larger or smaller) from the other observations in a dataset, lying at the extreme ends of the data's range, and can indicate errors or interesting variations.

Standardization (Z-Score Scaling)

Standardization transforms the data so that it has a mean of 0 and a standard deviation of 1. Instead of squashing everything into a box, it centers the data and describes each point by how many "steps" (standard deviations) it is away from the average.

Here is the formula:

$$x_{std} = \frac{x - \mu}{\sigma}$$

(Where μ is the mean and σ is the standard deviation)

Best For: Most Deep Learning tasks and algorithms that perform better when data follows a Gaussian (Normal) distribution.
The Advantage: It is much more robust to outliers. An extreme value will simply have a high Z-score (like 10.0) rather than ruining the scaling of every other point in the dataset.

Implementing Scaling

We are going to implement a scaling system which allows us to switch between Normalization, Standardization, or No Scaling simply by changing an environment variable, so that you can experiment with different scaling techniques without recompiling the code.

Define the Scaling Enum

First, we define an enum to represent our strategies.

#[derive(Debug, Clone, PartialEq)]
enum ScalingMethod {
    None,   // Raw values
    Norm,   // Min-Max Normalization [0, 1]
    Stand,  // Z-Score Standardization (Mean 0, Std 1)
}

Updating the Configuration

We update the TrainConfig to read the SCALING_METHOD from our .env file.

#[derive(Debug, Clone)]
pub struct TrainConfig {
    hidden_size: usize,
    batch_size: usize,
    num_epochs: usize,
    scaling_method: ScalingMethod, // New field
    run_name: String,
}

impl TrainConfig {
    fn new() -> Self {
        // ... (other fields)
        let scaling_method = match std::env::var("SCALING_METHOD").unwrap_or_default().as_str() {
            "norm" => ScalingMethod::Norm,
            "stand" => ScalingMethod::Stand,
            _ => ScalingMethod::None,
        };

        TrainConfig {
            // ...
            scaling_method,
            run_name,
        }
    }
}

The Multi-Method `prepare_tensors`

prepare_tensors now follows a three-step process. Calculate Statistics, Apply Transformation, and Batch Tensors.

Step A: Calculate Statistics

Before creating tensors, we must determine the "scale" of our current data slice.

For Standardization: We find the Mean (μ) and Standard Deviation (σ).
For Normalization: We find the Minimum and Maximum values.

let data = read_data_from_csv("data.csv").expect("should read data from csv");
let batch_size = self.train_config.batch_size;

// Initialize stats: (param1, param2)
// For Stand: (mean, std) | For Norm: (min, max)
let (mut x_stats, mut y_stats) = ((0.0, 1.0), (0.0, 1.0));

let start = range.start;
let end = range.end;
let slice = &data[start..end];

let xs: Vec<f32> = slice.iter().map(|(x, _)| *x).collect();
let ys: Vec<f32> = slice.iter().map(|(_, y)| *y).collect();

Step B: Apply Transformation

We iterate through our data and apply the appropriate formula.

match self.train_config.scaling_method {
	ScalingMethod::Stand => {
		let x_mean = xs.iter().sum::<f32>() / xs.len() as f32;
		let y_mean = ys.iter().sum::<f32>() / ys.len() as f32;
		let x_std = (xs.iter().map(|&v| (v - x_mean).powi(2)).sum::<f32>()
			/ xs.len() as f32)
			.sqrt();
		let y_std = (ys.iter().map(|&v| (v - y_mean).powi(2)).sum::<f32>()
			/ ys.len() as f32)
			.sqrt();
		x_stats = (x_mean, x_std);
		y_stats = (y_mean, y_std);
	}
	ScalingMethod::Norm => {
		let x_min = xs.iter().fold(f32::INFINITY, |a, &b| a.min(b));
		let x_max = xs.iter().fold(f32::NEG_INFINITY, |a, &b| a.max(b));
		let y_min = ys.iter().fold(f32::INFINITY, |a, &b| a.min(b));
		let y_max = ys.iter().fold(f32::NEG_INFINITY, |a, &b| a.max(b));
		x_stats = (x_min, x_max);
		y_stats = (y_min, y_max);
	}
	ScalingMethod::None => {}
}

let mut inputs: Vec<Tensor<B, 2>> = Vec::new();
let mut targets: Vec<Tensor<B, 1>> = Vec::new();

for (x, y) in slice.iter() {
	let (x_final, y_final) = match self.train_config.scaling_method {
		ScalingMethod::Stand => (
			(*x - x_stats.0) / x_stats.1.max(1e-8),
			(*y - y_stats.0) / y_stats.1.max(1e-8),
		),
		ScalingMethod::Norm => (
			(*x - x_stats.0) / (x_stats.1 - x_stats.0).max(1e-8),
			(*y - y_stats.0) / (y_stats.1 - y_stats.0).max(1e-8),
		),
		ScalingMethod::None => (*x, *y),
	};

	inputs.push(Tensor::<B, 1>::from_floats([x_final], &self.device).unsqueeze());
	targets.push(Tensor::<B, 1>::from_floats([y_final], &self.device));
}

Step C: Batching

Finally, we use Tensor::cat to stack individual samples into batches (e.g., shape [10, 1]).

let mut batched_inputs: Vec<Tensor<B, 2>> = Vec::new();
let mut batched_targets: Vec<Tensor<B, 1>> = Vec::new();

for i in (0..inputs.len()).step_by(batch_size) {
	let end = std::cmp::min(i + batch_size, inputs.len());
	let input_tensor = Tensor::cat(
		inputs[i..end]
			.iter()
			.map(|t| t.clone().unsqueeze())
			.collect(),
		0,
	);
	let target_tensor = Tensor::cat(
		targets[i..end]
			.iter()
			.map(|t| t.clone().unsqueeze())
			.collect(),
		0,
	);

	batched_inputs.push(input_tensor);
	batched_targets.push(target_tensor);
}

Here is the full code:

pub fn prepare_tensors(
	&self,
	range: std::ops::Range<usize>,
) -> (Vec<(Tensor<B, 2>, Tensor<B, 1>)>, (f32, f32), (f32, f32)) {
	let data = read_data_from_csv("data.csv").expect("should read data from csv");
	let batch_size = self.train_config.batch_size;

	// Initialize stats: (param1, param2)
	// For Stand: (mean, std) | For Norm: (min, max)
	let (mut x_stats, mut y_stats) = ((0.0, 1.0), (0.0, 1.0));

	let start = range.start;
	let end = range.end;
	let slice = &data[start..end];

	let xs: Vec<f32> = slice.iter().map(|(x, _)| *x).collect();
	let ys: Vec<f32> = slice.iter().map(|(_, y)| *y).collect();

	// ---- 1. Calculate Statistics based on Method ----
	match self.train_config.scaling_method {
		ScalingMethod::Stand => {
			let x_mean = xs.iter().sum::<f32>() / xs.len() as f32;
			let y_mean = ys.iter().sum::<f32>() / ys.len() as f32;
			let x_std = (xs.iter().map(|&v| (v - x_mean).powi(2)).sum::<f32>()
				/ xs.len() as f32)
				.sqrt();
			let y_std = (ys.iter().map(|&v| (v - y_mean).powi(2)).sum::<f32>()
				/ ys.len() as f32)
				.sqrt();
			x_stats = (x_mean, x_std);
			y_stats = (y_mean, y_std);
		}
		ScalingMethod::Norm => {
			let x_min = xs.iter().fold(f32::INFINITY, |a, &b| a.min(b));
			let x_max = xs.iter().fold(f32::NEG_INFINITY, |a, &b| a.max(b));
			let y_min = ys.iter().fold(f32::INFINITY, |a, &b| a.min(b));
			let y_max = ys.iter().fold(f32::NEG_INFINITY, |a, &b| a.max(b));
			x_stats = (x_min, x_max);
			y_stats = (y_min, y_max);
		}
		ScalingMethod::None => {}
	}

	let mut inputs: Vec<Tensor<B, 2>> = Vec::new();
	let mut targets: Vec<Tensor<B, 1>> = Vec::new();

	for (x, y) in slice.iter() {
		let (x_final, y_final) = match self.train_config.scaling_method {
			ScalingMethod::Stand => (
				(*x - x_stats.0) / x_stats.1.max(1e-8),
				(*y - y_stats.0) / y_stats.1.max(1e-8),
			),
			ScalingMethod::Norm => (
				(*x - x_stats.0) / (x_stats.1 - x_stats.0).max(1e-8),
				(*y - y_stats.0) / (y_stats.1 - y_stats.0).max(1e-8),
			),
			ScalingMethod::None => (*x, *y),
		};

		inputs.push(Tensor::<B, 1>::from_floats([x_final], &self.device).unsqueeze());
		targets.push(Tensor::<B, 1>::from_floats([y_final], &self.device));
	}

	let mut batched_inputs: Vec<Tensor<B, 2>> = Vec::new();
	let mut batched_targets: Vec<Tensor<B, 1>> = Vec::new();

	for i in (0..inputs.len()).step_by(batch_size) {
		let end = std::cmp::min(i + batch_size, inputs.len());
		let input_tensor = Tensor::cat(
			inputs[i..end]
				.iter()
				.map(|t| t.clone().unsqueeze())
				.collect(),
			0,
		);
		let target_tensor = Tensor::cat(
			targets[i..end]
				.iter()
				.map(|t| t.clone().unsqueeze())
				.collect(),
			0,
		);

		batched_inputs.push(input_tensor);
		batched_targets.push(target_tensor);
	}

	(
		batched_inputs.into_iter().zip(batched_targets).collect(),
		x_stats,
		y_stats,
	)
}

Notice that the function now also returns the scaling statistics. These values are required during model evaluation so that we can reverse the scaling and convert the model’s predictions back to their original, real-world scale.

Running the Code

Now let’s run the updated function and compare the results with the previous version. Since prepare_tensors now returns additional values (the scaling statistics), we need to make a small change in the main function.

Update the call to prepare_tensors:

-let tensors = model.prepare_tensors(0..1000);
+let (tensors, _, _) = model.prepare_tensors(0..1000);

We ignore the returned statistics for now, as they will be used later during evaluation.

Observing the Output

After running the code, you should see output similar to the following:

---------1th iteration start---------
---------forward pass---------
Layer 1 pre-activation (z1) shape=[10, 64], min=-0.370, max=2.205
Layer 1 activation (a1) shape=[10, 64], min=0.000, max=2.205
Layer 2 pre-activation (z2) shape=[10, 1], min=1.000, max=142.093
Loss shape=[1], min=4836.177, max=4836.177

Notice how the loss value is now dramatically smaller compared to the earlier runs.

Conclusion

We solved the explosion of loss value problem by implementing two scaling methods.

Without Scaling: Our loss was over 6,000,000.
With Scaling (Stand): Loss values now start in the thousands, and often below 2.0 depending on initialization

By squashing our x³ values into a manageable range, we have created an environment where the neural network's weights can actually learn.

Part 6: Data Scaling – Normalization vs. Standardization

Normalization (Min-Max Scaling)

Standardization (Z-Score Scaling)

Implementing Scaling

Define the Scaling Enum

Updating the Configuration

The Multi-Method `prepare_tensors`

Step A: Calculate Statistics

Step B: Apply Transformation

Step C: Batching

Running the Code

Observing the Output

Conclusion

Comments

Understanding Deep Learning by Building It in Rust

Part 7: Backward Pass

More from this blog

Part 7: Backward Pass

Part5: Forward Pass

Part4: Building a Simple Regression Model and Preparing Training Tensors

Part 3: Building a Neural Network Layer

Command Palette

Normalization (Min-Max Scaling)

Standardization (Z-Score Scaling)

Implementing Scaling

Define the Scaling Enum

Updating the Configuration

The Multi-Method prepare_tensors

Step A: Calculate Statistics

Step B: Apply Transformation

Step C: Batching

Running the Code

Observing the Output

Conclusion

Comments

Understanding Deep Learning by Building It in Rust

Part 7: Backward Pass

More from this blog

The Multi-Method `prepare_tensors`