Synthetic Data Pipeline for python - Chapter C.1.1
Teaching Python capabilities to a model during pretraining using synthetic dataset.
Series: Building Synthetic Data Pipelines
Stage: Pretraining
Domain: Code
Tier: 1 (Foundational)
This is the first chapter of a series on building synthetic data pipelines. Each post is organized by training stage (Pretraining, SFT, Mid-training, RL) and difficulty tier (Tier 1 beginner, Tier 2 intermediate, Tier 3 expert).
PS: I hope I continue to write after this.
Thanks for reading SparseDense! Subscribe for free to receive new posts and support my work.
Today: teaching Python capabilities to a model during pretraining.
Having worked at a top Indian AI company, training foundational models I have realised “There is nothing right or nothing wrong in synthetic data curation”. Anything can work or fail since creating synthetic data is usually a work of imagination. While I can’t directly share the specific methodologies we use, I can still teach some synthetic data engineering pipelines for education purposes. I have been working on synthetic data from the past 4 years and the techniques mentioned here are those that are most likely to work [saying from my experience]. However I don’t have the compute to test all of them myself.
Introduction
What are we building and why?
Today we are building a synthetic pretraining dataset. Specifically python programming problems and solutions designed to make a base model better at writing python code.
The inspiration comes from Section 2.3.2 of technical report of NVIDIA’s Nemotron 3 Super . They built a dataset called Code Concepts which has 15 million Python problem solution pairs generated syntheticaly. When they mixed 10B tokens of this data into the final 100B tokens of Nemotron Nano v3 pretraining HumanEval accuracy jumped 6 points (73 → 79).
In this blog/chapter, I will walk through exactly what they did, how they did it and where I think they left room for improvement. This is not a NVIDIA critique writeup, it treats their technical report as a base case study to understand what we can learn from it.
But first why target HumanEval? Isn’t it saturated?
Fair question. HumanEval is 164 hand-written Python problems from 2021. Frontier models score 90%+ on it. So why bother?
The field has moved through three eras:
Can you write a function? -> HumanEval, MBPP (2021–2023)
Can you solve hard problems? —> LiveCodeBench, CRUXEval (2024)
Can you be a software engineer? —> SWE-bench, Terminal Bench (2024–now)
Each era stacks on the previous one. A model that scores well on SWE-bench but poorly on HumanEval would be suspicious. The simpler benchmarks still serve as sanity checks and for smaller models they are far from saturated.
More importantly my agenda here is not to build a dataset for HumanEval. Its just taking inspiration from it. HumanEval tells us what foundational Python skills look like: standalone functions, clean docstrings, edge case handling, basic algorithmic thinking. If a model can’t do this, it can’t do anything harder. This is Tier1 the foundation before moving to advanced concepts.
“Benchmarks as inspiration not as targets”
Little bit on HumanEval: HumanEval is 164 problems, all standalone pure Python functions. No imports beyond typing, no external libraries, no file I/O, no classes. Just “here’s a docstring, write the function body.”
But if goal is to learn python then why not pandas, numpy, sklearn?
If we are teaching Python, why not cover libraries? Four reasons:
Its pretraining, not SFT. Pretraining builds foundational code reasoning: control flow, data manipulation, edge case handling, algorithmic thinking. Library specific API knowledge is better taught during supervised fine tuning.
Verification becomes 10x harder. For pure algorithmic problems you can verify correctness with ast.parse. For pandas code you need actual dataframes. For matplotlib you need to render plots.
Library APIs change, algorithms dont. dp.knapsack is timeless. pd.DataFrame.groupby() might break when pandas 3.0 changes the API. Pretraining data that teaches stale APIs actively hurts the model.
Libraries are already in pretraining. GitHub is full of pandas code and scikit learn tutorials. The pretraining corpus already has billions of tokens of organic library usage. What it lacked was structured, concept driven algorithmic problem solving, the exact gap this synthetic dataset fills.
Stage 1: Build a Concept Taxonomy
Purpose
The main purpose of this stage is to build a controlled taxonomy of python programming concepts that will seed all the downstream data generation. The taxonomy acts as the diversity engine for the entire pipeline.
So in easy words: Given a python problem, what are the concepts that problem is testing??
The approach has three steps:
build a broad taxonomy from curated sources
normalize it into a consistent hierarchy
filter it through coding benchmarks to keep only what matters for Python problem-solving.
Step 1.1: Extract concepts from curated sources
One straightforward way to build the initial taxonomy is to run an LLM as a topic classifier on Python and DSA textbooks. Honestly, you don’t even have to run on the entire book, only detailed outline and table of contents is enough.
I used 6-7 books to get initial taxonomy. Books like Data Structures and Algorithms using Python, Fluent Python, Python Cookbook, Effective Python and problem tags from LeetCode and Codeforces are a very good starting point.
Just passed the table of contents chapter by chapter and got like 900 raw concepts. These concepts use a hierarchical dot-notation format. Here are some real examples from the dataset:
algorithms.technique.two-pointer
algorithms.technique.sliding-window
analytics.math.gcd-lcm
data-structures.mapping.dictionary
functionality.processing.reshaping
Think of these as atomic concepts that can be combined. A problem tagged algorithms.technique.two-pointer + algorithms.arrays.processing might ask you to find pairs in a sorted array. I hope you got this part!!
Alternate approach
NVIDIA started with extraction of thousands of concepts from their pretraining code corpus and then filtered them through HumanEval. They built the big taxonomy first, then used it as a lens to classify HumanEval.
Think of it like this: they had a dictionary of 3,000 programming concepts. They looked at each HumanEval problem and asked which concepts from our dictionary does this problem test? 91 dictionary entries matched. They used those 91.
Their focus was mainly on “Which programming concepts does HumanEval actually test?”. So after extracting concepts from all 164 HumanEval problems and filtering pretraining code corpus based on concepts, they landed on 91 unique concepts.
PROMPT 1.1 (Concept Extraction):
You are a programming education expert. Analyze this reference material and extract every distinct programming concept, technique, data structure, and pattern it teaches or covers.
Return a taxonomical dot-notation representation of each concept, organized hierarchically by category. Each concept should follow the pattern: category.subcategory.specific-concept
Rules:
- Each concept should be a reusable skill, not specific to this problem
- Use lowercase-with-hyphens within levels, dots between levels
- Include 1-5 concepts, ranked by importance
Examples of concept labels:
- algorithms.technique.two-pointer
- data-structures.mapping.dictionary
- functionality.processing.parsing
- algorithms.technique.bit-manipulation
Return ONLY valid JSON:
{
"concepts": ["category.subcategory.detail", ...]
}
Material:
{content}
Suggested Model: Frontier, Qwen-3.5-27b
Cost: $
Note: For using NVIDIA’s approach you can use the same prompt with minimal modifications where you classify pretraining code samples instead of reference material.
Because Prompt 1.1 is not constrained, you get a richer initial concept space. The mess is intentional. We need to write another prompt to clean this up which is basically merging related concepts and deduping them.
Step 1.2: Normalize into a clean taxonomy
The raw concepts from multiple books and sources will have duplicates, synonyms and inconsistent naming. Binary search might appear as algorithms.searching.binary-search or algorithms.technique.binary-search and data-structures.arrays.binary-search across different sources.
I used the Prompt 1.2 to normalize everything into a consistent 3 level hierarchy.
PROMPT 1.2 (Dedup/Clean Taxonomies):
You are a taxonomy designer.
Below are {n} programming concept tags that were independently extracted from Python textbooks and DSA references. There are many duplicates, synonyms and inconsistent names.
Your job: produce a CLEAN, CANONICAL taxonomy.
Step 1 — Merge synonyms:
- algorithms.pointers.two-pointer and patterns.array.two-pointers → algorithms.technique.two-pointer
- data-structures.hash.hashmap and mapping.dict.dictionary → data-structures.mapping.dictionary
- math.arithmetic.modular and algorithms.math.modular-arithmetic → analytics.math.modular-arithmetic
Step 2 — Remove noise:
- Too vague: algorithms.general.problem-solving, functionality.utility.general
- Too specific to one problem: algorithms.array.leetcode-121-stock-profit
- About syntax, not concepts: basics.control-flow.for-loop, "basics.syntax.if-statement
- About non-programming concerns: practice.interview.preparation
Step 3 — Fix hierarchy depth:
Every tag that survived Steps 1-2 must be exactly 3 levels.
2-level tag → infer and add the missing subcategory:
- algorithms.recursion → algorithms.recursion.base-case
- data-structures.stack → algorithms.technique.stack-usage
4-level tag → flatten by merging the two most related levels:
- algorithms.graph.traversal.bfs → graph.traversal.bfs
- functionality.data.processing.filtering → functionality.processing.filtering
Step 4 — Standardize naming conventions:
- All lowercase
- Hyphens between words within a level: two-pointer not two_pointer or twopointer
- Dots between levels: algorithms.technique.two-pointer
- No trailing hyphens or dots
- Consistent vocabulary:
"dictionary" not "dict" or "hashmap" or "hash-table" (pick one, use everywhere)
"two-pointer" not "two-pointers" or "2-pointer"
- Consistent top-level domains: algorithms, data-structures, functionality, analytics, dp, graph
Output JSON format:
{{
"taxonomy": ["canonical.concept.one", ...],
"merge_log": {{"kept": ["merged_synonym_1", "merged_synonym_2"]}},
"removed": {{"tag": "reason"}},
"category_summary": {{"category_name": count}}
}}
Raw tags:
{tags}
Suggested Model: Frontier
Cost: $
Step 1.3: Filter through coding benchmarks
This is an optional step but its better to keep your concept generations on target. Not every programming concept in a textbook might be useful for our dataset. Django middleware patterns, async/await paradigms and deployment scripts are real skills but they are not testable as standalone function completion problems, which is what we are generating. (Always remembers the constraints we have while taking a decision)
Prompt 1.3 — Benchmark classification (constrained):
You are an expert Python programming analyst.
Given a Python problem and a taxonomy of programming concepts, identify which concepts from the taxonomy are needed to solve this problem.
IMPORTANT: Choose ONLY from the provided taxonomy. Do not invent new concepts. If a problem requires a skill not in the taxonomy, use the closest match.
Taxonomy:
{taxonomy}
Problem:
{problem}
Return ONLY valid JSON:
{
"task_id": "{task_id}",
"concepts": ["concept.from.taxonomy", ...],
"reasoning": "brief justification"
}
Suggested Model: Frontier
Cost: $
Any concept that appears in at least once survives into the final taxonomy. Concepts that exist only in the textbook vocabulary but never match a benchmark problem get dropped they’re valid programming knowledge but not useful for this dataset.
Frankly I will just handpick rather than giving every job to LLM
Are these taxonomies enough??
NVIDIA’s 91 concepts were filtered through one lens: what does HumanEval test? Since HumanEval allows no imports beyond typing, no file I/O and no classes, their taxonomy naturally excluded anything that touches Python’s standard library. No collections.Counter, no heapq, no itertools.
This was the right decision for their goal, they wanted to prove that concept driven synthetic data improves a specific benchmark and they did (+6 points on HumanEval).
But our goal is different. We don’t want to optimize for one benchmark, we are building foundational Python fluency for pretraining. The dataset should teach the model how a competent Python developer actually thinks and writes code.
So I did some manual analyis and built an improved taxonomy. I divide these into tiers:
Tier 1: Critical gaps (17 concepts): Binary search, greedy algorithms, backtracking, DP subtypes, graph specificity (Dijkstra, topological sort, connected components, cycle detection), prefix sum, Python stdlib patterns (collections.Counter, heapq, itertools, functools.cache).
Tier 2: Important additions (17 concepts): Divide and conquer, list/dict comprehension, enumerate, zip, idioms, character frequency, combinatorics, monotonic stack, sliding window with counter, simulation, precomputation, generator patterns, exception handling, basic class design, two-pass technique, json handling.
Tier 3 — Selective cherry picks (14 concepts): Union find, geometry distance, number factorization, anagram patterns, regex, string conversion, array flattening, key value inversion, bisect module, decorators, context managers.
There were some duplicates and zero signal concepts in Nvidia’s taxonomy (Appendix A) so after removing them I was left with 73 concepts. After combining 73 concepts with the ones mentioned above I got: 141 unique, non-overlapping, Python-native concepts across 12 categories. The full taxonomy is in Appendix B.
Stage 2: Concept Combination & Problem Generation
With a taxonomy in hand, we now generate millions of Python problems. The idea: combine concepts into sets of 1–4 and for each set, ask an LLM to generate a problem that tests all of them simultaneously.
NVIDIA’s approach was straightforward, enumerate every possible combination of their 91 concepts and generate 5 problems per combo:
C(91,1) = 91
C(91,2) = 4,095
C(91,3) = 121,485
C(91,4) = 2,672,670
───────────────────────
Total 2,798,341 combinations
Problems 2,798,341 combinations × 5 problems each ≈ 14M problems
They used gpt-oss-20b a smaller, cheaper model for problem generation. Each problem had to include a descriptive function name and a problem description in the function docstring.
Note that the concepts passed into prompts use the dot notation tags from Stage 1 (algorithms.technique.two-pointer, data-structures.mapping.dictionary) not plain English phrases. The model must interpret these hierarchical labels and produce problems that meaningfully exercise the described skills.
The Problem with Blind Enumeration at scale
NVIDIA’s brute force works because they have 91 concepts. But with our expanded 141 concept taxonomy, blind enumeration explodes:
C(141,1) = 141
C(141,2) = 9,870
C(141,3) = 457,310
C(141,4) = 15,777,195
─────────────────────────
Total ~16,244,516 combinations
k=4 alone is ~16 million combinations which is 97% of the total. But heres the main part: k=1, k=2 and k=3 are mostly fine. Two concepts compose naturally ~90% of the time. Even at k=3, most triples are workable the occasional awkward one gets filtered in Stage 4. But here is a problem: most of the combinations when k=4 can be incoherent if done blindly.
The Fix: Affinity Groups
The solution is affinity groups: clusters of concepts that naturally compose together. The 141 concepts divide into 12 groups (arrays, strings, dict/set patterns, algorithmic techniques, dynamic programming, graph algorithms, recursion/backtracking, math/number theory, data processing, Python stdlib, problem-solving strategy, security).
Not every group pair produces coherent problems. Security + graph or recursion + data processing rarely make sense as a single problem. But arrays + techniques is bread and butter i.e. two-pointer on a sorted array, sliding window with prefix sum.
So I have roughly defined 30 high affinity pairs, 15 triples and 8 quads that represent groups most likely to co occur in real coding problems.
The full detailed math for this step is present in Appendix C but here is the summary:
Thats ~8.5% of the full C(141,4) = 15,777,195. We skip 91.5% of the combinatorial space the cross group junk that rarely produces coherent problems. {or it can but I just don’t want to believe now}
The full dataset math
The number of problems per k is bit varied rather than keeping 5 for all.
k=1 (full enum, ×10): 141 combos → 1,410 problems
k=2 (full enum, ×10): 9,870 combos → 98,700 problems
k=3 (full enum, ×5): 457,310 combos → 2,286,550 problems
k=4 (affinity, ×5): 1,348,000 combos → 6,740,000 problems
──────────────────────────────────────────────────────────────────
TOTAL: 1,815,321 combos → 9,126,660 problems
Remember there is no rule that you have to invent a new approach like affinity groups at each stage, its just that if you can do something better than brute force then do it, because there is one thing that will always be in shortage: THE COMPUTE!
NVIDIA’s philosophy was “generate loose, clean strict.” They used a minimal prompt and relied on Stage 4 (import check + AST validation) to enforce quality.
But we have to add constraints upfront i.e. type hints, doctests, edge cases which is a different philosophy: “generate strict, clean stricter.”
Both work. NVIDIA’s approach is simpler but has a higher Stage 4 rejection rate. Mine is more complex but should produce fewer garbage problems.
Prompt 2.1: Generate Problem Statement
You are an expert Python problem designer. Your job is to generate a Python programming problem that tests a given set of concepts.
Requirements:
1. Write only the problem skeleton, no solution code. The skeleton is: any necessary imports, then either a function signature or a class stub with method signatures, plus a docstring.
2. Use descriptive snake_case names for functions, PascalCase for classes.
3. Include type hints for all parameters and return types.
4. The docstring must include:
- A clear problem description
- At least 2 examples in >>> doctest format
- At least one edge case example (empty input, single element, etc.)
5. If the concepts require stdlib modules (heapq, collections, re, json, itertools, functools, bisect, math, dataclasses, contextlib, typing etc) include the necessary imports at the top. Only Python standard library imports are allowed.
6. Most problems should be a single function. But if the concepts involve class design, dataclasses, decorators or context managers, use the appropriate structure:
- class-design / dataclasses → class stub with method signatures and `pass` bodies
- decorators → a decorator function signature + a sample function it should wrap
- context-managers → a class with `__enter__`/`__exit__` stubs or a generator with `@contextmanager`
7. For each listed concept, ask yourself: "would someone need to know this concept to solve this problem?" If the answer is no for any concept, redesign the problem until the answer is yes for all.
Wrap your output in XML tags. No explanation, no markdown fences.
<problem>
...your problem skeleton here...
</problem>
--- input below ---
<concepts>{concepts}</concepts>
Suggested Model: gpt-oss-20b, qwen-3.5-27b
Cost: $$$
Step 2.2: Deduplicate generated problems
At ~9M generated problems, duplicates will be sure shot there. The same concept combination at different temperatures might produce essentially the same problem with minor wording changes. Sending duplicates to Stage 3 wastes our most expensive resource: the solution model. Remember we don’t have infinite compute!
Duplication can be done at two levels:
Exact dedup: Identical function names + identical docstrings → keep only one.
Fuzzy dedup: Problems that differ only in variable names, wording or formatting but essentially tests the same thing.
Deduping at a level where you want to run on ~9M problems is not a very simple problem. Please refer to this excellent resource from Nvidia on how to do fuzzy/exact dedup using Nemo Curator.
In practice, fuzzy dedup should removes 5-15% of problems. Again here writing from my experience, the number can be much higher or lower. This step is cheap (minutes on a single machine with MinHash) and saves significant cost in Stage 3. {Don’t skip this}
Step 2.3: Benchmark decontamination
At ~9M generated problems some will inevitably resemble HumanEval, MBPP because either the LLM memorized them or because common concept combinations naturally produce similar problems. If these leak into pretraining, downstream benchmark numbers are meaningless: the model isn’t solving problems, it’s recalling training data.
A check against a contamination set (all 164 HumanEval problems + 974 MBPP problems + any other benchmarks you plan to evaluate on) is needed.
This check can be done at two levels: exact dedup and fuzzy dedup as discussed previously. This is a cheap step as the contamination set is small, so comparing ~9M generated problems against it is minutes of work.
Step 2.4: Concept verification and difficulty labeling
Not every generated problem actually tests the concepts it was generated from. A problem tagged with 4 concepts might only need 2. Its always a good choice to know your data upfront rather than crying later on.
Generate as much metadata as you can to understand your data.
Prompt 2.3 — Concept verification and difficulty scoring
You are an expert Python programming analyst.
Given a Python problem and the concepts it was designed to test, evaluate the problem on two dimensions.
Task 1 — Concept verification:
For each listed concept, determine whether a correct solution to this problem would genuinely require that concept. A concept is "required" if removing that skill from the solver's knowledge would make the problem unsolvable or significantly harder.
Task 2 — Difficulty assessment:
Rate the problem on two scales:
- Category: "easy", "medium", or "hard"
- Score: 0-10 where:
0-2 = trivial (one-liner, basic syntax)
3-4 = easy (straightforward application of one technique)
5-6 = medium (requires combining techniques or careful thinking)
7-8 = hard (requires insight, non-obvious approach, or
multiple algorithmic steps)
9-10 = very hard (competition-level, requires advanced knowledge)
Return ONLY valid JSON:
{
"verified_concepts": ["concept.actually.needed", ...],
"dropped_concepts": {
"concept.not.needed": "reason it's not required"
},
"all_concepts_tested": true/false,
"difficulty": {
"category": "easy|medium|hard",
"score": 0-10,
"reasoning": "brief explanation"
}
}
Problem:
{problem}
Concepts it should test:
{concepts}
Suggested Model: Qwen-3.5-35b-a3b, gpt-oss-120b
Cost: $$$
This gives us three outputs per problem:
1. Verified concept tags. The problem keeps only the concepts it actually tests. This metadata can be useful later on to do filtering.
2. Concept coverage flag. all_concepts_tested: true/false tells us whether the problem is doing its job. We don’t necessarily reject problems where not all concepts are tested, but you must know your data.
3. Difficulty labels. The category and 0-10 score enables two things downstream:
Balanced dataset construction. We can ensure the final dataset has a healthy mix of easy, medium and hard problems.
Curriculum-aware pretraining. If you are planning to do phase wise pretraining this is really important. Some pretraining recipes might want to show harder problems in later phases. Difficulty labels make this possible without re-classification later.
If the distribution is too skewed lets say 80% easy, we can selectively regenerate more problems for concept combinations that consistently produce medium/hard problems or increase k=3 and k=4 problems per combo. So all this metadata is really helpful for you to focus what you are going to work on next.
Is this step really necessary?
If you are short on compute you can choose to skip this step, but as a good practice you should run this atleast on 10% of your data to understand the distribution of data. This was not present in NVIDIA’s paper but I am sure they might be generating metadata too in similar ways.
Stage 3: Multi Solution Generation
Each of the ~9.1M problems from Stage 2 now needs solutions (There could be less depending on dedup). NVIDIA generated 5 solutions per problem, I guess thats reasonable. But here we need to use a stronger model than the one used for problem generation: writing correct, clean code requires stronger reasoning than writing a function signature.
NVIDIA used gpt-oss-120b for this stage (vs 20B for problems) with a flat 60 line limit and generated ~23M problem solution pairs before cleaning. I am not sure how did they arrive at the 23M number, because they had around 14M problems so each generating 5 solutions would be 70M, maybe they just stopped suddenly after generating 23M.
Line limits by complexity
A k=1 problem testing just math.gcd shouldn’t need 60 lines. A k=4 problem combining graph.paths.weighted + dp.approach.memoization + algorithms.backtracking.constraint-solving + graph.modeling.adjacency-dict might genuinely need 80. Using one limit for both either bloats simple solutions or cramps complex ones.
We scale the limit by k (Again something fancy but I guess must be done):
This line limit is enforced programmatically after generation. Just using a simple script like count non empty, non commented lines and then discard solutions that exceed the limit.
Style diversity across the 4 solutions
The 4 solutions per problem are not just 4 attempts at the same thing with different random seeds. Each targets a different coding style:
This means the pretraining data teaches the model that the same problem can be solved with a for loop or itertools with recursion or iteration in 8 lines or in 30 given the LLM is smart enough and we are lucky that it is able to generate this diversity.
Style diversity per problem is strictly more valuable per token than 4 slight variations of the same approach.
Prompt 3.1: Solution generation
The prompt is structured in a different way (because i somehow love xml more than json) so all static instructions come first and all per problem variables are at the end in XML tags. This maximizes KV cache reuse across ~36M calls the model only recomputes attention for the variable tail. Yay!!
You are an expert Python programmer. You will be given a Python problem skeleton (function signature or class stub with a docstring) and a list of programming concepts it tests. Your job is to write the implementation.
Rules:
1. Keep all names, parameters, type hints, and docstrings exactly as provided. Write only the implementation body (function body, method bodies, or decorator logic).
2. Keep your solution under the specified line limit (non empty, non comment lines). Write clean, readable Python and do not sacrifice clarity for brevity.
3. Do not add any imports beyond what is already provided in the problem.
4. Your solution must correctly handle all examples in the docstring including edge cases.
5. You may define helper functions or inner classes before the main definition if needed.
6. Your solution must demonstrate the listed concepts. For example if the concepts include dp.pattern.knapsack, use dynamic programming and not brute-force enumeration even if brute force would also work.
Wrap your output in XML tags. Put the complete code (imports from the problem + your implementation) inside the tags. No explanation, no markdown fences.
<solution>
...your complete code here...
</solution>
--- inputs below ---
<concepts>{concepts}</concepts>
<line_limit>{line_limit}</line_limit>
<style>{style_instruction}</style>
<problem>
{problem}
</problem>
Suggested Model: gpt-oss-120b, qwen3.5-122B-A10B
Cost: $
What this produces
~9.1M problems × 4 solutions = ~36.4M problem-solution pairs.
Not all 4 solutions for a given problem will survive cleaning. Some will add unauthorized imports, some will fail AST validation, some will modify the original problem. Thats fine even if we only get 50% we will still get diverse enough dataset.
Stage 4: Solution Cleaning & Validation
This is where ~36.4M raw pairs become a clean final dataset. Every problem solution pair passes through a pipeline of filters: each catching a different class of failure [this is such a headache]. A pair must survive all filters to make it into the final dataset.
NVIDIA applied three filters (import check, solution extraction, AST validation) and saw a ~35% rejection rate. These three are a good starting point but I would suggest four more which are more complicated than AST validation as it alone can’t detect major issues.
Filter 4.1: Import validation
The solution must not add any imports beyond what the original problem specifies.
# Problem says:
from typing import List
import heapq
# Solution adds:
import numpy as np # ← REJECT: unauthorized import
Why this matters: without this check, the solution model cheats by importing libraries that do the heavy lifting. A solution that imports ‘networkx’ for a graph problem hasn’t demonstrated the algorithmic thinking the concepts were supposed to test.
Extract imports from both the problem and solution using ast.parse(), compare the sets and reject any solution where solution_imports - problem_imports is non empty. Simple solution no LLM calls!
Filter 4.2: Signature and docstring validation
The solution model outputs complete code inside tags. Extract that code (simple regex) and then verify it preserved the original problem skeleton: same function/class names, same parameter names, same docstring.
import ast, re
def extract_solution(raw_output: str) -> str | None:
match = re.search(r'<solution>(.*?)</solution>', raw_output, re.DOTALL)
return match.group(1).strip() if match else None
def signatures_match(problem_code: str, solution_code: str) -> bool:
"""Check that the solution kept names, params, and docstring intact."""
prob_tree = ast.parse(problem_code)
sol_tree = ast.parse(solution_code)
for prob_node, sol_node in zip(
(n for n in ast.walk(prob_tree) if isinstance(n, (ast.FunctionDef, ast.ClassDef))),
(n for n in ast.walk(sol_tree) if isinstance(n, (ast.FunctionDef, ast.ClassDef)))
):
if prob_node.name != sol_node.name:
return False
if isinstance(prob_node, ast.FunctionDef):
prob_params = [a.arg for a in prob_node.args.args]
sol_params = [a.arg for a in sol_node.args.args]
if prob_params != sol_params:
return False
return True
If extract_solution returns None → model didn’t use XML tags, reject. If signatures_match returns False → model renamed something, reject.
No stitching, no subtle bugs. Either accept the solution exactly as the model wrote it or throw it away. We don’t need to keep things where LLMs didn’t follow our instructions.
NVIDIA’s pipeline needed stitching because their prompt didn’t anchor names strongly. Ours explicitly instructs the model to keep all names and parameters intact and uses XML output tags for clean extraction, so compliance is high enough that rejecting non compliant outputs is cheaper than trying to surgically repair them.
Filter 4.3: AST validation
The extracted solution code must parse as valid Python.
import ast
try:
ast.parse(solution_code)
except SyntaxError:
# REJECT: not valid Python
This catches unclosed brackets, malformed strings, indentation errors, incomplete function bodies and any other syntax issues. Its fast, deterministic and requires no execution environment.
Filter 4.4: Line limit enforcement
The solution model was instructed to stay under the line limit (40/60/80/100 by k), but LLMs count poorly. Enforce it programmatically:
def count_code_lines(solution_body: str) -> int:
"""Count non-empty, non-comment lines."""
return sum(
1 for line in solution_body.strip().split('\n')
if line.strip() and not line.strip().startswith('#')
)
Solutions exceeding the limit for their k level are discarded. This ensures the pretraining data consistently models concise solutions for simple problems and allows more room only where complexity demands it.
Filter 4.5: Trivial and broken solution detection
AST validation confirms the code is syntactically valid — but syntactically valid code can still be useless:
# Passes AST check but is obviously wrong:
def count_primes(n: int) -> int:
"""Count primes less than n."""
pass
def count_primes(n: int) -> int:
"""Count primes less than n."""
return None
def count_primes(n: int) -> int:
"""Count primes less than n."""
return 0
def count_primes(n: int) -> int:
"""Count primes less than n."""
...
Reject solutions where the entire function body (or all method bodies in a class) is one of: pass, ..., return None, return 0, return “”, return [], return {} or return False. Also reject solutions shorter than 2 non trivial lines, a real solution to a real problem is never a single return statement.
Remember the order of filters is very important. Like run this filter 5 only after executing the previous filters.
Filter 4.6: Doctest execution (optional)
This is the most powerful filter and the one NVIDIA didn’t use (or they did but never told). If the problem includes >>> doctest examples in the docstring, we can actually run them against the solution using pythons built in doctest module inside a sandboxed environment with a timeout. Read the line again, inside a sandboxed environment. Sandboxed environment is a must.
Why this is powerful: AST validation confirms the code is valid Python. Doctest execution confirms it actually works. A solution that parses correctly but produces wrong outputs gets caught here.
Why it’s optional: it requires an execution environment (sandboxed, with timeouts and memory limits). Some solutions will involve randomness, floating point imprecision, or system dependent behavior that causes false failures. And at ~36M pairs, execution takes real compute time.
of course, if you have compute constraints you can run doctest execution on a sample (10-20% of pairs) to measure the quality delta. If it catches a significant additional percentage beyond the other filters, make it a standard filter. If most of what it catches is floating point edge cases, keep it optional.
The reference implementation for the sandboxed doctest runner is in Appendix D.
Filter 4.7: Duplicate detection
At ~36M pairs, some concept combinations will produce near identical problems and some solutions will be near identical across problems. Excessive duplication wastes pretraining tokens.
Deduplicate at two levels like we did previously:
Exact dedup: same problem text + same solution body → keep only one. This is a simple hash comparison after stripping whitespace.
Fuzzy dedup: solutions that differ only in variable names, whitespace, or comment phrasing → keep only one. The same approach discussed in Stage 2 for problem level dedup. (Will write a complete writeup on this one if there is more interest)
Closing
Thanks for making this far, this was a brief writeup on how to create synthetic data for a specific task like teaching python concepts to a model during pretraining.
If someone in an interview asks me this questions, this would be my response. Honestly no one is gonna ask such a trivial question.
But why stop here?
Think of all the programming languages and use the techniques mentioned here as a foundation to create synthetic data for them. You can replicate the exact same data engineering pipelines for different languages.
The pipeline is the product here. The specific numbers 141 concepts, 12 affinity groups, 4 solutions are tunable. The architecture i.e. taxonomy → concept combinations → constrained generation → deterministic validation is what matters. Swap the concepts, change the k multipliers and add your own filters. The framework stays the same.
References
Appendix:
Appendix are available on github to read. Since the post was getting long. I linked directly to github readme.
Thanks for reading SparseDense! Subscribe for free to receive new posts and support my work.


