Part 2: Generating Training Data
With 2+ years of experience in web backend development, I now specialize in AI engineering, building intelligent systems and scalable solutions. Passionate about crafting innovative software, I love exploring new technologies, experimenting with AI models, and bringing ideas to life. Always learning, always building.
Now that we've seen tensors, it's time to start building the deep learning process. The first thing we need is a dataset for training. Instead of using pre-made dataset, we are going to build our own from scratch. Why? Because generating synthetic data is the best way to see exactly how our model learns specific mathematical relationship.
We’ll be creating a dataset based on a cubic function for a simple regression model:
$$y=0.01x3−0.5x+20+noise$$
We will generate pairs of (x,y) values from the function above. Each Sample will have a samll random noise. This is because real world dataset are not perfect and it will simulate this.
Project Setup
We’ll continue from where we left off in Part 1.
Create a new file called
data_generator.rs.In your
main.rs, add the following line at the top:mod data_generator;Inside
data_generator.rs, define two functions:One for generating data.
Another for saving the data to a CSV file.
// data_generator.rs
use std::error::Error;
pub fn generate_and_save_data(num_points: usize) -> Result<(), Box<dyn Error>> {}
fn save_data_to_csv(data: &Vec<(f32, f32)>, filename: &str) -> Result<(), Box<dyn Error>> {}
Next, add the required dependencies to Cargo.toml.
We’ll use rand and rand_distr crates for random number generation and Gaussian noise.
[package]
name = "simple-regression"
version = "0.1.0"
edition = "2021"
[dependencies]
burn = { version = "0.17.1", features = ["ndarray"] }
rand = "0.9.1"
rand_distr = "0.5.1"
Generate Data
Let’s start with generate function.
Step 1. Define the noise distribution
let noise_std_dev = 10.0;
let normal_dist = Normal::new(0.0, noise_std_dev)?;
We use a normal (Gaussian) distribution centered at 0 with a standard deviation of 10.
This creates realistic random noise.
Step 2. Set the x-range and generate the data points
let mut rng = rand::rng();
let x_range_min = -100.0;
let x_range_max = 100.0;
let mut data: Vec<(f32, f32)> = Vec::new();
for _ in 0..num_points {
let x_f: f32 = rng.random_range(x_range_min..x_range_max);
let noise = normal_dist.sample(&mut rng);
let y = 0.01 * x_f.powi(3) - 0.5 * x_f + 20.0 + noise;
data.push((x_f, y));
}
For each data point:
Pick a random
xwithin the range.Compute the corresponding
yusing the formula.Add Gaussian noise to simulate real-world imperfections.
Store the pair
(x, y)in a vector.
Save data to CSV
The helper function writes all (x,y) pairs into a file called data.csv.
fn save_data_to_csv(data: &Vec<(f32, f32)>, filename: &str) -> Result<(), Box<dyn Error>> {
let file = File::create(filename)?;
let mut writer = BufWriter::new(file);
writeln!(writer, "x,y")?;
for (x, y) in data {
writeln!(writer, "{},{}", x, y)?;
}
println!("Data saved to {}", filename);
Ok(())
}
Complete Code
use rand::Rng;
use rand_distr::{Distribution, Normal};
use std::error::Error;
use std::fs::File;
use std::io::{BufWriter, Write};
pub fn generate_and_save_data(num_points: usize) -> Result<(), Box<dyn Error>> {
let noise_std_dev = 10.0;
let normal_dist = Normal::new(0.0, noise_std_dev)?;
let mut rng = rand::rng();
let x_range_min = -100.0;
let x_range_max = 100.0;
let mut data: Vec<(f32, f32)> = Vec::new();
for _ in 0..num_points {
let x_f: f32 = rng.random_range(x_range_min..x_range_max);
let noise = normal_dist.sample(&mut rng);
let y = 0.01 * x_f.powi(3) - 0.5 * x_f + 20.0 + noise;
data.push((x_f, y));
}
save_data_to_csv(&data, "data.csv")?;
Ok(())
}
fn save_data_to_csv(data: &Vec<(f32, f32)>, filename: &str) -> Result<(), Box<dyn Error>> {
let file = File::create(filename)?;
let mut writer = BufWriter::new(file);
// Optional: write header
writeln!(writer, "x,y")?;
for (x, y) in data {
writeln!(writer, "{},{}", x, y)?;
}
println!("Data saved to {}", filename);
Ok(())
}
Run the code
Now, let’s get our generator to work. We’ll call the function in main.rs to produce our training ground.
use crate::data_generator::generate_and_save_data;
fn main() {
generate_and_save_data(100000).expect("Failed to generate and save data.");
}
Once you run this, you will find a data.csv file in your root directory. The values will look something like this:
x,y
-78.34,-4712.52
12.45,3.86
98.21,9340.12
(Note: Your exact numbers will vary due to the random noise.)
We can actually plot this data. Visualizing the data helps you understand our training dataset and we can compare this with the predicted result from the model later.
Conclusion
We now have our very own dataset.
In Part 3, we will take these CSV values, load them back into tensors and build our first regression model using Burn framework.