Skip to main content

Command Palette

Search for a command to run...

Part 2: Generating Training Data

Updated
4 min read
B

With 2+ years of experience in web backend development, I now specialize in AI engineering, building intelligent systems and scalable solutions. Passionate about crafting innovative software, I love exploring new technologies, experimenting with AI models, and bringing ideas to life. Always learning, always building.

Now that we've seen tensors, it's time to start building the deep learning process. The first thing we need is a dataset for training. Instead of using pre-made dataset, we are going to build our own from scratch. Why? Because generating synthetic data is the best way to see exactly how our model learns specific mathematical relationship.

We’ll be creating a dataset based on a cubic function for a simple regression model:

$$y=0.01x3−0.5x+20+noise$$

We will generate pairs of (x,y) values from the function above. Each Sample will have a samll random noise. This is because real world dataset are not perfect and it will simulate this.

Project Setup

We’ll continue from where we left off in Part 1.

  1. Create a new file called data_generator.rs.

  2. In your main.rs, add the following line at the top:

    mod data_generator;
    
  3. Inside data_generator.rs, define two functions:

    • One for generating data.

    • Another for saving the data to a CSV file.

// data_generator.rs
use std::error::Error;

pub fn generate_and_save_data(num_points: usize) -> Result<(), Box<dyn Error>> {}

fn save_data_to_csv(data: &Vec<(f32, f32)>, filename: &str) -> Result<(), Box<dyn Error>> {}

Next, add the required dependencies to Cargo.toml.
We’ll use rand and rand_distr crates for random number generation and Gaussian noise.

[package]
name = "simple-regression"
version = "0.1.0"
edition = "2021"

[dependencies]
burn = { version = "0.17.1", features = ["ndarray"] }
rand = "0.9.1"
rand_distr = "0.5.1"

Generate Data

Let’s start with generate function.

Step 1. Define the noise distribution

let noise_std_dev = 10.0;
let normal_dist = Normal::new(0.0, noise_std_dev)?;

We use a normal (Gaussian) distribution centered at 0 with a standard deviation of 10.
This creates realistic random noise.

Step 2. Set the x-range and generate the data points

let mut rng = rand::rng();

let x_range_min = -100.0;
let x_range_max = 100.0;

let mut data: Vec<(f32, f32)> = Vec::new();

for _ in 0..num_points {
    let x_f: f32 = rng.random_range(x_range_min..x_range_max);
    let noise = normal_dist.sample(&mut rng);
    let y = 0.01 * x_f.powi(3) - 0.5 * x_f + 20.0 + noise;

    data.push((x_f, y));
}

For each data point:

  • Pick a random x within the range.

  • Compute the corresponding y using the formula.

  • Add Gaussian noise to simulate real-world imperfections.

  • Store the pair (x, y) in a vector.

Save data to CSV

The helper function writes all (x,y) pairs into a file called data.csv.

fn save_data_to_csv(data: &Vec<(f32, f32)>, filename: &str) -> Result<(), Box<dyn Error>> {
    let file = File::create(filename)?;
    let mut writer = BufWriter::new(file);

    writeln!(writer, "x,y")?;
    for (x, y) in data {
        writeln!(writer, "{},{}", x, y)?;
    }

    println!("Data saved to {}", filename);
    Ok(())
}

Complete Code

use rand::Rng;
use rand_distr::{Distribution, Normal};
use std::error::Error;
use std::fs::File;
use std::io::{BufWriter, Write};

pub fn generate_and_save_data(num_points: usize) -> Result<(), Box<dyn Error>> {
    let noise_std_dev = 10.0;

    let normal_dist = Normal::new(0.0, noise_std_dev)?;
    let mut rng = rand::rng();

    let x_range_min = -100.0;
    let x_range_max = 100.0;

    let mut data: Vec<(f32, f32)> = Vec::new();

    for _ in 0..num_points {
        let x_f: f32 = rng.random_range(x_range_min..x_range_max);
        let noise = normal_dist.sample(&mut rng);
        let y = 0.01 * x_f.powi(3) - 0.5 * x_f + 20.0 + noise;

        data.push((x_f, y));
    }

    save_data_to_csv(&data, "data.csv")?;
    Ok(())
}

fn save_data_to_csv(data: &Vec<(f32, f32)>, filename: &str) -> Result<(), Box<dyn Error>> {
    let file = File::create(filename)?;
    let mut writer = BufWriter::new(file);

    // Optional: write header
    writeln!(writer, "x,y")?;

    for (x, y) in data {
        writeln!(writer, "{},{}", x, y)?;
    }

    println!("Data saved to {}", filename);
    Ok(())
}

Run the code

Now, let’s get our generator to work. We’ll call the function in main.rs to produce our training ground.

use crate::data_generator::generate_and_save_data;

fn main() {
    generate_and_save_data(100000).expect("Failed to generate and save data.");
}

Once you run this, you will find a data.csv file in your root directory. The values will look something like this:

x,y
-78.34,-4712.52
12.45,3.86
98.21,9340.12

(Note: Your exact numbers will vary due to the random noise.)

We can actually plot this data. Visualizing the data helps you understand our training dataset and we can compare this with the predicted result from the model later.

Conclusion

We now have our very own dataset.

In Part 3, we will take these CSV values, load them back into tensors and build our first regression model using Burn framework.

Understanding Deep Learning by Building It in Rust

Part 3 of 8

Learn deep learning by building it from scratch in Rust using Burn only for tensors. We’ll implement activations, losses, backprop, and optimizers step by step to understand how neural networks truly work.

Up next

Part 3: Building a Neural Network Layer

Now that our dataset is ready, we are ready to start building the components of our neural network. But before wejump into the code, let’s pause for a moment to define what deep learning actually is.