Code Dataset Generator

Transform your source code into fine-tuning datasets for Falcon 40B and other LLMs

Upload Source Files

Drag & drop your source files here

or

Supported: Python, C/C++, Rust, Go, JavaScript, Java, PHP, Ruby, TypeScript

Selected Files

{{ file.name }} {{ formatFileSize(file.size) }}

Processing Options

Dataset Preview

Processing files... {{ processedFiles }} / {{ selectedFiles.length }}

Generated Samples

{{ previewData.length }} items
{{ previewData[selectedSampleIndex].instruction }}
{{ previewData[selectedSampleIndex].input }}
{{ previewData[selectedSampleIndex].output }}

Your processed dataset will appear here

Key Features

Multi-language Support

Automatically detects and processes code in Python, C/C++, Rust, Go, JavaScript and more.

Smart Prompt Engineering

Automatically generates meaningful instruction-output pairs from your source code.

Falcon 40B Optimized

Tokenization and formatting specifically optimized for Falcon 40B model fine-tuning.

Made with DeepSite LogoDeepSite - 🧬 Remix