Part 7: Backward Pass
With 2+ years of experience in web backend development, I now specialize in AI engineering, building intelligent systems and scalable solutions. Passionate about crafting innovative software, I love exploring new technologies, experimenting with AI models, and bringing ideas to life. Always learning, always building.
Up to this point, We have built a 'Forward Pass' that makes predictions, and a 'Loss Function' that measures exactly how far off those predictions are.
In order to enable our model to learn patterns from this error, we need components that can calculate the mistakes and apply the corrections. This is where the Backward Pass and the Optimizer come in.
In this part, we will build these two core components from scratch. Let's look at what is actually happening behind the scenes.
1. The Backward Pass
The Backward Pass (Backpropagation) is the process of moving backward from the Loss to the inputs to calculate Gradients.
A gradient is essentially a value that tells us: "If I increase this weight by a tiny amount, how much will the total Loss change?"
The Chain Rule: Since our loss is computed through a sequence of nested functions—from inputs through layers to the final loss (Loss = f(g(x))), we use the calculus chain rule to break down and pass the "error signal" backward, multiplying local derivatives along the way.
Layer-by-Layer: We don't calculate everything at once. We calculate how much the output layer is to blame for the loss, immediately derive the gradients for that layer's weights, and then pass the remaining responsibility back to the previous hidden layers.
2. The Optimizer
If the Backward Pass tells us which direction to move and how much a weight contributes to the error, the Optimizer is the one that actually turns the knobs.
The simplest and most common optimizer is Stochastic Gradient Descent (SGD). It updates each weight using a simple rule:
$$\text{New Weight} = \text{Old Weight} - (\text{Learning Rate} \times \text{Gradient})$$
Learning Rate: This is a small multiplier. It controls the "step size." If it's too big, we overshoot the solution; if it's too small, training takes forever.
The "Descent": We subtract the gradient because we want to go against the error. We are hiking down a mountain (the Loss) to find the lowest valley (the perfect weights).
3. The Big Picture
When we combine everything, we get the complete "Learning Loop" that happens in every single iteration:
Forward Pass: Data goes in → Prediction comes out.
Loss Calculation: Measures exactly how far off our prediction is from the ground truth.
Backward Pass: Computes the gradients (slopes) for every single weight and bias by tracing the error backward.
Optimizer Step: Adjusts the weights slightly in the direction that reduces the loss, taking our model one step closer to the optimal solution.
Implementing the Optimizer: SGD
Now that we understand the concept of "stepping" toward a better solution, let's look at the code. In Burn, parameters are wrapped in a Param struct. Our goal is to take those parameters, apply the gradient update, and re-initialize them with their new, improved values.
The Optimizer Trait
First, we define a generic Optimizer trait. This allows us to keep our code resuable. If you wanted to swap SGD for a more advanced optimizer like Adam later, you would only need to implement this trait.
use burn::module::Param;
use burn::prelude::{Backend, Tensor};
pub trait Optimizer<B: Backend> {
fn step<const D: usize>(
&mut self,
lr: f32,
param: &mut Param<Tensor<B, D>>,
grad: Tensor<B, D>,
);
}
Stochastic Gradient Descent (SGD)
Our first implementation is the classic SGD. As we discussed, its job is to move the weights in the opposite direction of the gradient.
pub struct SGD {}
impl<B: Backend> Optimizer<B> for SGD {
fn step<const D: usize>(
&mut self,
lr: f32,
param: &mut Param<Tensor<B, D>>,
grad: Tensor<B, D>,
) {
// 1. Calculate the update: (learning_rate * gradient)
let update = grad.mul_scalar(lr);
// 2. Subtract the update from the current value: w_new = w - update
// We use unsqueeze() here to ensure the shapes match during the subtraction
let updated = param.val().unsqueeze().sub(update);
// 3. Update the parameter with the new value while preserving its ID
*param = Param::initialized(param.id, updated);
}
}
Why we update this way
grad.mul_scalar(lr): The gradient tells us the "direction" of the error. Multiplying by the learning rate ensures we only take a small, controlled step in that direction.param.sub(update): Since the gradient points toward the increase in loss, we subtract it to move toward the decrease (the minimum).Param::initialized: In Burn, parameters are immutable by default for safety. To "change" them, we create a newParamwith the updated tensor data but keep the originalIDso the model knows it's still the same weight.
Implementing the Training loop
With our SGD implementation ready, we need to wire it into our model's architecture. This involves two steps: updating our configuration to handle the "speed" of learning and adding the optimizer as a persistent part of our model struct.
Updating the Configuration
To control our optimizer, we need a Learning Rate. It determines how large of a step the optimizer takes when updating weights.
We’ll update TrainConfig to pull this value from our environment variables:
#[derive(Debug, Clone)]
pub struct TrainConfig {
hidden_size: usize,
batch_size: usize,
num_epochs: usize,
scaling_method: ScalingMethod,
// New
learning_rate: f32,
run_name: String,
}
impl TrainConfig {
fn new() -> Self {
// ... (existing config logic)
// Read learning rate from env, defaulting to 0.01
let learning_rate = std::env::var("LEARNING_RATE")
.unwrap_or_else(|_| "0.01".to_string())
.parse()
.unwrap_or(0.01);
// .....
TrainConfig {
hidden_size,
batch_size,
num_epochs,
scaling_method,
learning_rate,
run_name,
}
}
}
Updating the model struct
Next, we add the optimizer to our SimpleRegressionModel. By storing it in the struct, the model gains the ability to modify its own weights during the training process.
Update your model definition and initialization:
pub struct SimpleRegressionModel<B: Backend> {
// ... (existing fields)
optimizer: SGD, // New
}
impl<B: Backend> SimpleRegressionModel<B> {
pub fn init(device: &B::Device) -> Self {
// ....
let optimizer = SGD {};
Self {
train_config,
d_input,
d_output,
input_layer,
output_layer,
device: device.clone(),
optimizer,
}
}
}
Implementing the Backward Pass
The Backward Pass is the actual engine of machine learning. While the forward pass calculates the output, the backward pass works in reverse, using calculus (specifically the Chain Rule) to determine how much each weight and bias contributed to the total error.
1. The Starting Point: Gradient of the Loss
Our first step is to calculate the derivative of our Mean Squared Error (MSE) loss with respect to our predictions (z2). Mathematically, the derivative of (P - A)² is 2(P - A).
// grad_logits is the 'error signal' coming from the loss function
// We multiply by 2/N because of the power of 2 in MSE
let grad_logits = z2.sub(targets) * (2.0 / self.train_config.batch_size as f32);
debug_tensor("grad_logits", &grad_logits);
Shape: [Batch, 1]
2. Backpropagating through Output Layer (Layer 2)
To update the weights of the output layer, we calculate how the error changes relative to the weights and the inputs that entered that layer (a1).
// Gradient for weights: Inputs^T * grad_output
let grad_weight_2 = a1.transpose().matmul(grad_logits.clone());
// Gradient for bias: Sum of the errors across the batch
let grad_bias_2 = grad_logits.clone().sum_dim(0);
Shapes:
grad_weight_2:[64, 1](Matching the output weight shape)grad_bias_2:[1, 1]
3. Passing the Error through the Activation (ReLU)
Before we can calculate gradients for Layer 1, we must pass the error signal through the ReLU activation. If a neuron was "off" (output was zero) during the forward pass, the gradient becomes zero—it can't learn anything from this iteration.
// 1. Calculate how much the hidden layer values (a1) contributed to the final error
let grad_hidden = grad_logits.matmul(weight_2.transpose());
// 2. Apply the ReLU derivative (the gatekeeper)
let grad_hidden_relu = ReLU::backward(grad_hidden, z1.clone());
Shape: [Batch, 64]
4. Backpropagating through Input Layer (Layer 1)
Finally, we arrive at the first layer. We use the error signal that survived the ReLU "gate" to calculate the gradients for our input weights and biases.
// Gradient for weights: Inputs^T * grad_from_relu
let grad_weight_1 = inputs.clone().transpose().matmul(grad_hidden_relu.clone());
// Gradient for bias: Sum across the batch
let grad_bias_1 = grad_hidden_relu.clone().sum_dim(0);
Shapes:
grad_weight_1:[1, 64]grad_bias_1:[1, 64]
5. The Optimizer Step: Updating the Model
Now that we have the gradients (the "directions"), we tell our Optimizer to take a step. We apply the learning rate to these gradients and adjust the actual weights in the model.
// Example: Updating the Input Layer Weight
self.optimizer.step(
self.train_config.learning_rate,
&mut self.input_layer.weight,
grad_weight_1,
);
// Note: For biases, we squeeze(0) because sum_dim(0) leaves an extra dimension [1, 64]
// and the bias parameter expects [64].
self.optimizer.step(
self.train_config.learning_rate,
&mut self.input_layer.bias,
grad_bias_1.squeeze(0),
);
6. Full Code of Learning Process
pub fn do_train(&mut self, input_target_tensors: Option<Vec<(Tensor<B, 2>, Tensor<B, 1>)>>) {
create_artifact_dir(&self.train_config.run_name);
let input_target_tensors = input_target_tensors.unwrap();
let mut iteration: usize = 1;
let num_epochs = self.train_config.num_epochs;
for epoch in 0..num_epochs {
println!(
"================= Epoch {}/{} =================",
epoch + 1,
num_epochs
);
for (inputs, targets) in input_target_tensors.iter() {
println!("---------{iteration}th iteration start---------");
println!("---------forward pass---------");
let weight_1 = self.input_layer.weight.val().unsqueeze();
let bias_1 = self.input_layer.bias.val().unsqueeze();
let z1 = inputs.clone().matmul(weight_1) + bias_1;
debug_tensor("Layer 1 pre-activation (z1)", &z1);
let a1 = ReLU::forward(z1.clone());
debug_tensor("Layer 1 activation (a1)", &a1);
let weight_2 = self.output_layer.weight.val().unsqueeze();
let bias_2 = self.output_layer.bias.val().unsqueeze();
let z2 = a1.clone().matmul(weight_2.clone()) + bias_2;
debug_tensor("Layer 2 pre-activation (z2)", &z2);
let targets: Tensor<B, 2> = targets.clone().unsqueeze_dim(1);
let loss = self.compute_loss(z2.clone(), targets.clone());
debug_tensor("Loss", &loss);
println!("---------backward pass---------");
let grad_logits = z2.sub(targets) * (2.0 / self.train_config.batch_size as f32);
debug_tensor("grad_logits", &grad_logits);
let grad_weight_2 = a1.transpose().matmul(grad_logits.clone());
let grad_bias_2 = grad_logits.clone().sum_dim(0);
debug_tensor("grad_weight_2", &grad_weight_2);
debug_tensor("grad_bias_2", &grad_bias_2);
let grad_hidden = grad_logits.matmul(weight_2.transpose());
let grad_hidden_relu = ReLU::backward(grad_hidden, z1.clone());
let grad_weight_1 = inputs.clone().transpose().matmul(grad_hidden_relu.clone());
let grad_bias_1 = grad_hidden_relu.clone().sum_dim(0);
debug_tensor("grad_weight_1", &grad_weight_1);
debug_tensor("grad_bias_1", &grad_bias_1);
let mut grads = [
grad_weight_1.clone(),
grad_bias_1.clone(),
grad_weight_2.clone(),
grad_bias_2.clone(),
];
println!("---------updating weights and biases---------");
self.optimizer.step(
self.train_config.learning_rate,
&mut self.input_layer.weight,
grads[0].clone(),
);
// sum_dum preserves the shape of tensor, so we need to squeeze the first dimension
self.optimizer.step(
self.train_config.learning_rate,
&mut self.input_layer.bias,
grads[1].clone().squeeze(0),
);
self.optimizer.step(
self.train_config.learning_rate,
&mut self.output_layer.weight,
grads[2].clone(),
);
// sum_dum preserves the shape of tensor, so we need to squeeze the first dimension
self.optimizer.step(
self.train_config.learning_rate,
&mut self.output_layer.bias,
grads[3].clone().squeeze(0),
);
println!("---------{iteration}th iteration end---------");
iteration += 1;
}
}
}
Summary
The loop is now complete. For every batch of data:
Forward: x → z1 → a1 → z2 → Loss
Backward: Loss → grad_logits → grad_relu → Gradients
Update: Weights are adjusted slightly to make the Loss smaller next time.
Running the Code
We have built the layers, implemented the activation functions, calculated the loss, and wired up the backward pass with an optimizer. Now, it’s time to ignite the engine and watch our manual neural network learn to approximate our cubic function.
Analyzing the First Iteration
---------forward pass---------
Layer 1 pre-activation (z1) shape=[10, 64], min=-0.081, max=2.610
Layer 1 activation (a1) shape=[10, 64], min=0.000, max=2.610
Layer 2 pre-activation (z2) shape=[10, 1], min=1.000, max=168.025
Loss shape=[1], min=13768.413, max=13768.413
---------backward pass---------
grad_logits shape=[10, 1], min=0.336, max=33.206
grad_weight_2 shape=[64, 1], min=429.973, max=429.973
grad_bias_2 shape=[1, 1], min=206.768, max=206.768
grad_weight_1 shape=[1, 64], min=223.541, max=223.541
grad_bias_1 shape=[1, 64], min=206.432, max=206.432
---------1th iteration end---------
At the start, the Loss is 13,768. Notice the gradients—they are quite large (e.g., grad_weight_2 is 429.9). This indicates that the model has identified a massive error and is about to make a significant "jump" in its weight values to correct it.
Progress Check: The 200th Iteration
After 200 iterations of feeding data, calculating errors, and updating weights, the transformation is remarkable:
---------200th iteration start---------
---------forward pass---------
Layer 1 pre-activation (z1) shape=[10, 64], min=-0.334, max=1.561
Layer 1 activation (a1) shape=[10, 64], min=0.000, max=1.561
Layer 2 pre-activation (z2) shape=[10, 1], min=0.344, max=6.502
Loss shape=[1], min=9.544, max=9.544
---------backward pass---------
grad_logits shape=[10, 1], min=0.195, max=1.133
grad_weight_2 shape=[64, 1], min=4.454, max=4.454
grad_bias_2 shape=[1, 1], min=5.204, max=5.204
grad_weight_1 shape=[1, 64], min=0.128, max=0.128
grad_bias_1 shape=[1, 64], min=0.267, max=0.267
---------updating weights and biases---------
---------200th iteration end---------
Loss Reduction: The loss dropped from 13,768 to 9.54 (and it will continue to fall as training progresses).
Gradient Stabilization: The gradients have shrunk significantly (from 429.9 down to 4.45). This means the model is no longer making wild guesses; it is now "fine-tuning" its understanding of the cubic curve.
Conclusion
We successfully built the complete traning process. By running the full pipeline, we directly witnessed that loss valuescollapse and seeing the gradients systematically stabilize. Our custom neural network genuinely learning the patterns of our data.
Here is a quick look at the core pillars we’ve established so far:
The Architecture: We defined a hidden layer and an output layer from scratch.
The Non-Linearity: We implemented the ReLU activation to allow our model to understand curves, not just straight lines.
Stability: We learned why Standardization is vital when dealing with polynomial data like x³.
The Core Mechanics: We manually calculated the Backward Pass using the chain rule and updated our model using Stochastic Gradient Descent.
In the next part, we will look into how we can save our model to files so that we can load and reuse it.