SQL-R1: Training Natural Language to SQL Reasoning Model by Reinforcement Learning

Peixian Ma; Xialie Zhuang; Chengjin Xu; Xuhui Jiang; Ran Chen; Jian Guo

Introduction

Natural Language to SQL (NL2SQL) enables intuitive interactions with databases by transforming natural language queries into structured SQL statements. Recent advancements in NL2SQL have significantly enhanced the level of human-computer interaction within database query applications and contribute to a wide range of data science analysis tasks. Current NL2SQL models mainly focus on optimizing workflows and their components, such as schema linking, content retrieval, and generation correction.

Despite these advancements, significant challenges persist, particularly regarding the inference performance in complex scenarios involving multi-table joins and nested queries. Current methodologies primarily utilize supervised fine-tuning (SFT) to train NL2SQL models, which may limit adaptability and interpretability in new environments (e.g., finance and healthcare).

In order to enhance the reasoning performance of NL2SQL models in complex situations, we introduce SQL-R1, a novel NL2SQL reasoning model trained by reinforcement learning (RL) algorithms. We design a specialized RL-based reward function tailored for NL2SQL tasks and discuss the impact of cold start on the effectiveness of intensive training. Additionally, we achieve competitive accuracy using only a tiny amount of synthetic NL2SQL data for augmented training and further explore data engineering for RL.

Key Results: SQL-R1 achieves execution accuracy of 88.6% on the Spider benchmark and 67.1% on the BIRD benchmark, demonstrating state-of-the-art performance in natural language to SQL translation tasks.

Figure 1: Overview of SQL-R1 training framework using reinforcement learning for NL2SQL tasks.

Method

This section introduces two forms of training NL2SQL models via RL algorithms: direct reinforcement training and reinforcement training via cold start after SFT. Cold start refers to using specific data to train the base model by SFT first so that it has a particular ability to think and follow instructions. In addition, due to the limited real data, we use the latest synthetic data to support the training process.

Data Preparation

We utilize the SynSQL-2.5M dataset as our primary data source, which is the first million-scale synthetic NL2SQL dataset, encompassing over 2.5 million diverse and high-quality data samples. The dataset features more than 16,000 synthetic databases across various domains, ensuring extensive coverage of realistic scenarios.

SFT Dataset: We investigated the impact of cold start condition in SFT on RL training. We utilized a dataset comprising 200,000 samples drawn from SynSQL-2.5M for SFT training (SynSQL-200K), with uniform sample size across different difficulty levels. For each sample $v = (x, t, y^*)$, $x$ represents the NL, $t$ represents the reasoning process enclosed in <think>...</think> tags, and $y^*$ denotes the SQL enclosed in <answer>...</answer> tags.

RL Dataset: We randomly sampled 5K NL-SQL pairs from SynSQL-2.5M with Complex difficulty (SynSQL-Complex-5K). For each pair $v = (x, y^*)$ in the RL dataset, $x$ represents the NL while $y^*$ denotes the SQL candidate. The aim of reinforcement learning is to enhance the accuracy of the answers and ensure they adhere to the expected format.

Reinforcement Training with GRPO

In the reinforcement learning phase, we employ the Group Relative Policy Optimization (GRPO) algorithm to enhance our training protocol, which obviates the need for the value model and operates with less memory requirements. For each natural language question aligned with its corresponding database schema, the policy model generates a set of $G$ SQL candidates $\{o_1, o_2, \ldots, o_G\}$ from the old policy $\pi_{old}$, which are evaluated using a composite reward function.

$$\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{\mathbf{v} \sim P(\mathbf{V}), \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(O|\mathbf{v})} \left[ \frac{1}{G} \sum_{i=1}^G \left( \min \left( r_i^{\text{ratio}} A_i, \text{clip} \left( r_i^{\text{ratio}}, 1-\epsilon, 1+\epsilon \right) A_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right) \right]$$

where $r_i^{\text{ratio}} = \frac{\pi_\theta(o_i | V)}{\pi_{old}(o_i | V)}$ represents the importance sampling ratio; $A_i$ represents the group-relative advantage for each output; the clipping operator, hyperparameter $\epsilon$ and $\beta$ control the update step and divergence regularization; $\pi_{\text{ref}}$ represents the reference policy.

Reward Function Design

We utilize a progressive feedback mechanism consisting of four types of rewards: Format Reward, Execution Reward, Result Reward, and Length Reward. This layered approach enhances the model's learning by providing detailed feedback at various stages.

Format Reward ($S_f$): We encourage the model to enclose the reasoning process within <think>...</think> tags and the final answer within <answer>...</answer> tags. SQL statements must be contained within ```sql...``` tags.

$$S_f = \begin{cases} 1, & \text{if format is correct} \\ -1, & \text{if format is incorrect} \end{cases}$$

Execution Reward ($S_e$): Evaluates the syntactic correctness of SQL candidates, preventing the model from generating unexecutable responses.

$$S_e = \begin{cases} 2, & \text{if SQL candidate is executable} \\ 0, & \text{if format is incorrect} \\ -2, & \text{if SQL candidate is not executable} \end{cases}$$

Result Reward ($S_r$): Evaluates the accuracy of query results using Execution Accuracy (EX).

$$S_r = \begin{cases} 3, & \text{if query result is correct} \\ 0, & \text{if format is incorrect or not executable} \\ -3, & \text{if query result is incorrect} \end{cases}$$

Length Reward ($S_l$): Incentivizes the model to produce comprehensive reasoning processes while avoiding superfluous explanations.

$$S_l = \begin{cases} 0.5 \times S_{tl} + S_{al}, & \text{if correct and } len_{response} \leq \text{MAX\_LENGTH} \\ 0.5 + S_{al}, & \text{if correct and } len_{response} > \text{MAX\_LENGTH} \\ 0, & \text{other cases} \end{cases}$$

where $S_{tl} = (len_{think} + len_{answer}) / \text{MAX\_LENGTH}$ and $S_{al} = len_{sql} / len_{answer}$.

SQL Candidates Selection

To select the most appropriate SQL in the reasoning process, the model generates several SQL candidates and their thought processes for a problem. We execute all SQL candidates and select the SQL with the highest score as the final answer based on self-consistency voting. Notably, the reasoning response of SQL-R1 comprises an observable process of thinking and interpreting, making the results easier for the user to understand.

Experiments

We conduct comprehensive experiments on two widely-used NL2SQL benchmarks: Spider and BIRD. These datasets cover a wide range of database schemas and query complexities, providing a robust evaluation of our model's capabilities.

Main Results

Our SQL-R1 model demonstrates state-of-the-art performance on both benchmark datasets. The results show significant improvements over baseline supervised fine-tuning approaches:

Spider Benchmark: SQL-R1 achieves an execution accuracy of 88.6%, outperforming previous state-of-the-art methods
BIRD Benchmark: SQL-R1 achieves an execution accuracy of 67.1%, demonstrating strong generalization to more complex real-world scenarios

These results validate the effectiveness of our reinforcement learning approach in enhancing NL2SQL reasoning capabilities, particularly in handling complex queries involving multi-table joins, nested subqueries, and advanced SQL operations.

Table 1: Performance comparison on Spider and BIRD benchmarks. SQL-R1 shows consistent improvements across different model sizes.

Model Variants and Scalability

We release SQL-R1 in three different sizes to accommodate various deployment scenarios:

SQL-R1-3B: Lightweight model suitable for resource-constrained environments
SQL-R1-7B: Balanced performance and efficiency for general use cases
SQL-R1-14B: Maximum performance for demanding applications

All variants demonstrate consistent improvements over their respective supervised baselines, confirming the scalability of our reinforcement learning approach across different model capacities.

Additional Experimental Details

Our paper provides comprehensive experimental analysis beyond the results presented above. For detailed information on the following aspects, we refer readers to the full paper:

Ablation Studies:

Detailed analysis of individual component contributions, including reward function design, training strategies, and data engineering techniques

Error Analysis:

In-depth examination of failure cases and challenging query patterns across different complexity levels

Comparative Evaluation:

Extended comparisons with state-of-the-art methods on additional benchmarks and query types

Application Scenarios:

Discussion of practical deployment considerations and use cases in real-world database systems

📄 For complete experimental protocols, additional results, and comprehensive discussions

Read the Full Paper

Ma et al., "SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning", arXiv:2504.08600, 2025

Citation

If you find SQL-R1 useful for your research, please consider citing our paper:

@article{ma2025sql,
  title={SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning},
  author={Ma, Peixian and Zhuang, Xialie and Xu, Chengjin and Jiang, Xuhui and Chen, Ran and Guo, Jian},
  journal={arXiv preprint arXiv:2504.08600},
  year={2025}
}