Main Results
Empirical Evaluation of Existing LLMs
We benchmark state-of-the-art LLMs on NeuComBack-L2 (x86_64, 151 cases) to establish baselines for Neural Compilation. Reasoning-enhanced models show clear advantages: DeepSeek-R1 achieves the strongest baseline with 45.70% ACC and 21.85% ACC+Perf, while O3-Mini/O1 also perform notably better than non-reasoning-specialized models.
| Model | ACC (%) | ACC+Perf (%) |
|---|---|---|
| GPT-4o | 1.99 (3/151) | 0.66 (1/151) |
| O3-Mini | 21.19 (32/151) | 5.30 (8/151) |
| O1 | 19.87 (30/151) | 5.30 (8/151) |
| DeepSeek-V3 | 14.57 (22/151) | 3.31 (5/151) |
| DeepSeek-R1 | 45.70 (69/151) | 21.85 (33/151) |
ACC: functional correctness; ACC+Perf: correctness with runtime better than clang -O3.
Effect of Self-Evolving Prompt Optimization
Using DeepSeek-R1 on NeuComBack-L1 (x86_64), our learned prompt greatly boosts correctness: 50.00% → 80.00% ACC on the test set (40 samples), a relative improvement of ~60%.
| Method | ACC (%) |
|---|---|
| Baseline | 50.00 (20/40) |
| Our | 80.00 (32/40) |
On NeuComBack-L2 (x86_64, test 25), our learned prompt improves initial correctness
from 44.00% → 64.00% and boosts high-performance solutions from 24.00% → 40.00%.
After two rounds of iterative optimization, the advantage amplifies: ACC+Perf 28.00% → 56.00%
(2× relative improvement). Notably, among 16 correct programs from our method,
14 (87.5%) surpass -O3.
| Method & Stage | ACC (%) | ACC+Perf (%) |
|---|---|---|
| Baseline Prompt — After Initial Generation | 44.00 (11/25) | 24.00 (6/25) |
| Baseline Prompt — After 2 Rounds Iter. Opt. | -- | 28.00 (7/25) |
| Our (Learned Prompt) — After Initial Generation | 64.00 (16/25) | 40.00 (10/25) |
| Our (Learned Prompt) — After 2 Rounds Iter. Opt. | -- | 56.00 (14/25) |
Generalization to aarch64
On aarch64 (NeuComBack-L2, test 25), our method substantially improves both correctness and performance: ACC 36.00% → 72.00% and ACC+Perf 8.00% → 28.00%, showing strong applicability across different instruction sets.
| Method | ACC (%) | ACC+Perf (%) |
|---|---|---|
| Baseline | 36.00 (9/25) | 8.00 (2/25) |
| Our | 72.00 (18/25) | 28.00 (7/25) |
Transferability across Data Distributions
Prompts learned on NeuComBack-L2 transfer positively to NeuComBack-L1 (x86_64): test ACC improves from 50.00% → 67.50% without additional learning, and overall ACC reaches 74.50% across all 200 L1 cases.
| Prompt Strategy | Test Set ACC (%) | Overall ACC (%) |
|---|---|---|
| Default Prompt | 50.00 (20/40) | 54.50 (109/200) |
| Learned on NeuComBack-L1 | 80.00 (32/40) | − |
| Learned on NeuComBack-L2 | 67.50 (27/40) | 74.50 (149/200) |
Efficiency: Fewer Self-Debug Rounds
For correctly solved programs, our method consistently reduces the average number of self-debug rounds across architectures and datasets, indicating faster convergence to correct and performant assembly.
| Architecture | Dataset | Max Debug Rounds | Baseline | Our Method |
|---|---|---|---|---|
| x86_64 | NeuComBack-L1 | 1 | 0.90 | 0.28 |
| NeuComBack-L2 | 2 | 1.09 | 0.25 | |
| aarch64 | NeuComBack-L2 | 4 | 1.44 | 1.22 |