QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Hainan Fang^1,2, Yuanbo Wen¹, Jun Bi¹, Yihan Wang^{1, 2}, Tonghui He^{1, 2}, Yanlin Tang^{1, 2}, Di Huang¹, Jiaming Guo¹, Rui Zhang¹, Qi Guo¹, Yunji Chen^{1, 2}

¹State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences ²University of Chinese Academy of Sciences

NeurIPS 2025 arXiv PDF Code Dataset BibTeX

Overview

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques.

However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge.

Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities.

Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance. These consistent improvements across diverse architectures (x86_64 and aarch64) and program distributions (NeuComBack L1 and L2) validate our method's superiority over conventional approaches and its potential for broader adoption in low-level neural compilation.

NeuComBack Dataset

We construct a comprehensive dataset for the LLVM IR → Assembly (ASM) compilation task. As no public dataset specifically targets IR-to-ASM translation, we adapt two established C→ASM benchmarks: ExeBench, a widely used collection of C programs, and the Test Suite for Vectorizing Compilers (TSVC), a compiler performance benchmark. The distinct characteristics of these sources naturally lead to a two-tiered dataset:

Level 1 (Fundamental Compilation, 200 tasks)

This selection from the ExeBench test suite covers a broad variety of real-world C programs and features diverse control-flow patterns. A significant portion of these programs, derived from embedded systems applications, exhibits intensive I/O operations. While extremely complex control flow (e.g., deep recursion, concurrency) is underrepresented in available sources, L1 is designed with a breadth-first focus to assess general-purpose functional correctness.

Level 2 (Optimization Potential, 151 tasks)

This level, drawn from the TSVC benchmark suite, features programs characterized by computationally simple execution paths but notable loop intricacy. Such programs provide a strong basis for assessing the optimization capabilities reflected in the generated assembly.

Method

Teaser for self-evolving neural compilation

Figure 1: Pipeline of our automatic prompt learning method on Neural Compilation.

Our Neural Compilation framework leverages a novel automatic prompt-learning mechanism specifically tailored to improve LLM-generated assembly code. Distinct from previous prompt evolution methods, our approach's core insight lies in learning from the LLM's complete iterative self-debugging process, which effectively enables the LLM to learn from its past practices in diagnosing and resolving complex errors in assembly code generation. Our method comprises two distinct stages: offline prompt learning and online inference. In the offline stage, the model iteratively evolves prompts based on comprehensive analysis and insights derived from past generation trials. Subsequently, in the online stage, the model utilizes these refined prompts to generate and iteratively optimize assembly code, progressively improving quality and performance.

Offline: Prompt Learning

The offline learning phase focuses on automatically discovering and evolving prompts to effectively guide the LLM toward generating correct and performant assembly code.

This process begins with initializing an empty prompt template.
Following this, self-debug trials are collected by using the LLM to perform a ``compilation'' process and then test the generated assembly translations; errors trigger iterative self-debugging, refining code until correctness or iteration limits are reached.
Subsequently, critical insights are extracted by analyzing the complete self-debug trajectories to identify error patterns and effective strategies via LLM-assisted analysis.
Finally, prompt evolution occurs by refining prompts through the integration of experience and insights (extracted from a batch of self-debug trials), which are then reviewed for clarity and effectiveness.

Online: Inference

The online inference stage deploys evolved prompts from offline learning to iteratively generate and optimize assembly code.

First, the initial assembly generation is performed, using evolved prompts to generate initial assembly code from IR. This initial assembly is then tested, and iterative self-debugging is triggered until functional correctness is achieved; if self-debug fails after all attempts, the generation is terminated, and a failure is reported.
Next, iterative optimization is applied to further refine the initial assembly code, with evolved prompts still provided to minimize error introduction. Each optimization iteration includes testing and self-debugging as necessary to ensure continued correctness.
Finally, the process outputs the optimized assembly code, demonstrating competitive or superior performance compared to compiler-generated code.

Main Results

Empirical Evaluation of Existing LLMs

We benchmark state-of-the-art LLMs on NeuComBack-L2 (x86_64, 151 cases) to establish baselines for Neural Compilation. Reasoning-enhanced models show clear advantages: DeepSeek-R1 achieves the strongest baseline with 45.70% ACC and 21.85% ACC+Perf, while O3-Mini/O1 also perform notably better than non-reasoning-specialized models.

Table 1: Baseline performance of advanced LLMs, NeuComBack-L2 (overall 151 cases), x86_64
Model	ACC (%)	ACC+Perf (%)
GPT-4o	1.99 (3/151)	0.66 (1/151)
O3-Mini	21.19 (32/151)	5.30 (8/151)
O1	19.87 (30/151)	5.30 (8/151)
DeepSeek-V3	14.57 (22/151)	3.31 (5/151)
DeepSeek-R1	45.70 (69/151)	21.85 (33/151)

ACC: functional correctness; ACC+Perf: correctness with runtime better than clang -O3.

Effect of Self-Evolving Prompt Optimization

Using DeepSeek-R1 on NeuComBack-L1 (x86_64), our learned prompt greatly boosts correctness: 50.00% → 80.00% ACC on the test set (40 samples), a relative improvement of ~60%.

Table 2: Performance of automatic prompt learning vs. baseline prompt on NeuComBack-L1 (test set, 40 samples), x86_64, DeepSeek-R1.
Method	ACC (%)
Baseline	50.00 (20/40)
Our	80.00 (32/40)

On NeuComBack-L2 (x86_64, test 25), our learned prompt improves initial correctness from 44.00% → 64.00% and boosts high-performance solutions from 24.00% → 40.00%. After two rounds of iterative optimization, the advantage amplifies: ACC+Perf 28.00% → 56.00% (2× relative improvement). Notably, among 16 correct programs from our method, 14 (87.5%) surpass -O3.

Table 3: Performance of automatic prompt learning vs. baseline prompt on NeuComBack-L2 (test set, 25 samples), x86_64, DeepSeek-R1.
Method & Stage	ACC (%)	ACC+Perf (%)
Baseline Prompt — After Initial Generation	44.00 (11/25)	24.00 (6/25)
Baseline Prompt — After 2 Rounds Iter. Opt.	--	28.00 (7/25)
Our (Learned Prompt) — After Initial Generation	64.00 (16/25)	40.00 (10/25)
Our (Learned Prompt) — After 2 Rounds Iter. Opt.	--	56.00 (14/25)

Generalization to aarch64

On aarch64 (NeuComBack-L2, test 25), our method substantially improves both correctness and performance: ACC 36.00% → 72.00% and ACC+Perf 8.00% → 28.00%, showing strong applicability across different instruction sets.

Table 4: Effectiveness of automatic prompt learning, DeepSeekR1, NeuComBack-L2 (test set, 25 samples), aarch64
Method	ACC (%)	ACC+Perf (%)
Baseline	36.00 (9/25)	8.00 (2/25)
Our	72.00 (18/25)	28.00 (7/25)

Transferability across Data Distributions

Prompts learned on NeuComBack-L2 transfer positively to NeuComBack-L1 (x86_64): test ACC improves from 50.00% → 67.50% without additional learning, and overall ACC reaches 74.50% across all 200 L1 cases.

Table 5: Performance on NeuComBack-L1 (x86_64, DeepSeek-R1) using different prompt strategies, showing test set and overall ACC
Prompt Strategy	Test Set ACC (%)	Overall ACC (%)
Default Prompt	50.00 (20/40)	54.50 (109/200)
Learned on NeuComBack-L1	80.00 (32/40)	−
Learned on NeuComBack-L2	67.50 (27/40)	74.50 (149/200)

Efficiency: Fewer Self-Debug Rounds

For correctly solved programs, our method consistently reduces the average number of self-debug rounds across architectures and datasets, indicating faster convergence to correct and performant assembly.

Table 6: Average self-debug rounds for successfully compiled programs by DeepSeek-R1 on test sets, comparing our method with the baseline. Lower is better.
Architecture	Dataset	Max Debug Rounds	Baseline	Our Method
x86_64	NeuComBack-L1	1	0.90	0.28
x86_64	NeuComBack-L2	2	1.09	0.25
aarch64	NeuComBack-L2	4	1.44	1.22

Examples: IR → Assembly

Example 1：`s331`（x86_64）

IR → ASM

Input IR (LLVM)

; ModuleID = 'tmp/s331_inner.c'
source_filename = "tmp/s331_inner.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@a = external global [32000 x float], align 64
@b = external global [32000 x float], align 64
@c = external global [32000 x float], align 64
@d = external global [32000 x float], align 64
@e = external global [32000 x float], align 64
@aa = external global [256 x [256 x float]], align 64
@bb = external global [256 x [256 x float]], align 64
@cc = external global [256 x [256 x float]], align 64

; Function Attrs: nounwind uwtable
define dso_local i32 @s331_inner() local_unnamed_addr #0 {
  br label %1

1:                                                ; preds = %0, %5
  %2 = phi i32 [ 0, %0 ], [ %8, %5 ]
  br label %10

3:                                                ; preds = %5
  %4 = add nsw i32 %41, 1
  ret i32 %4

5:                                                ; preds = %10
  %6 = sitofp i32 %41 to float
  %7 = tail call i32 @dummy(ptr noundef nonnull @a, ptr noundef nonnull @b, ptr noundef nonnull @c, ptr noundef nonnull @d, ptr noundef nonnull @e, ptr noundef nonnull @aa, ptr noundef nonnull @bb, ptr noundef nonnull @cc, float noundef %6) #2
  %8 = add nuw nsw i32 %2, 1
  %9 = icmp eq i32 %8, 100000
  br i1 %9, label %3, label %1, !llvm.loop !5

10:                                               ; preds = %10, %1
  %11 = phi i64 [ 0, %1 ], [ %42, %10 ]
  %12 = phi i32 [ -1, %1 ], [ %41, %10 ]
  %13 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %11
  %14 = load float, ptr %13, align 4, !tbaa !7
  %15 = fcmp olt float %14, 0.000000e+00
  %16 = trunc i64 %11 to i32
  %17 = select i1 %15, i32 %16, i32 %12
  %18 = add nuw nsw i64 %11, 1
  %19 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %18
  %20 = load float, ptr %19, align 4, !tbaa !7
  %21 = fcmp olt float %20, 0.000000e+00
  %22 = trunc i64 %18 to i32
  %23 = select i1 %21, i32 %22, i32 %17
  %24 = add nuw nsw i64 %11, 2
  %25 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %24
  %26 = load float, ptr %25, align 4, !tbaa !7
  %27 = fcmp olt float %26, 0.000000e+00
  %28 = trunc i64 %24 to i32
  %29 = select i1 %27, i32 %28, i32 %23
  %30 = add nuw nsw i64 %11, 3
  %31 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %30
  %32 = load float, ptr %31, align 4, !tbaa !7
  %33 = fcmp olt float %32, 0.000000e+00
  %34 = trunc i64 %30 to i32
  %35 = select i1 %33, i32 %34, i32 %29
  %36 = add nuw nsw i64 %11, 4
  %37 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %36
  %38 = load float, ptr %37, align 4, !tbaa !7
  %39 = fcmp olt float %38, 0.000000e+00
  %40 = trunc i64 %36 to i32
  %41 = select i1 %39, i32 %40, i32 %35
  %42 = add nuw nsw i64 %11, 5
  %43 = icmp eq i64 %42, 32000
  br i1 %43, label %5, label %10, !llvm.loop !11
}

declare i32 @dummy(ptr noundef, ptr noundef, ptr noundef, ptr noundef, ptr noundef, ptr noundef, ptr noundef, ptr noundef, float noundef) local_unnamed_addr #1

attributes #0 = { nounwind uwtable "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #2 = { nounwind }

!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.ident = !{!4}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{!"clang version 17.0.6"}
!5 = distinct !{!5, !6}
!6 = !{!"llvm.loop.mustprogress"}
!7 = !{!8, !8, i64 0}
!8 = !{!"float", !9, i64 0}
!9 = !{!"omnipotent char", !10, i64 0}
!10 = !{!"Simple C/C++ TBAA"}
!11 = distinct !{!11, !6}

Generated Assembly (x86_64)

.text
.globl s331_inner
.type s331_inner, @function
s331_inner:
    pushq %rbp
    movq %rsp, %rbp
    pushq %r12
    pushq %r13
    pushq %r14
    leaq a(%rip), %r14
    xorl %r12d, %r12d
    pxor %xmm1, %xmm1

outer_loop:
    xorl %eax, %eax
    movl $-1, %r13d

.align 16
vector_loop:
    movaps (%r14, %rax, 4), %xmm0
    cmpltps %xmm1, %xmm0
    movmskps %xmm0, %ecx
    testl %ecx, %ecx
    jz .Lno_update0
    bsrl %ecx, %ecx
    leal (%rax, %rcx), %edx
    cmpl %r13d, %edx
    cmovgl %edx, %r13d
.Lno_update0:

    movaps 16(%r14, %rax, 4), %xmm2
    cmpltps %xmm1, %xmm2
    movmskps %xmm2, %ecx
    testl %ecx, %ecx
    jz .Lno_update1
    bsrl %ecx, %ecx
    leal 4(%rax, %rcx), %edx
    cmpl %r13d, %edx
    cmovgl %edx, %r13d
.Lno_update1:

    addq $8, %rax
    cmpq $32000, %rax
    jl vector_loop

    cvtsi2ssl %r13d, %xmm0
    subq $32, %rsp
    leaq a(%rip), %rdi
    leaq b(%rip), %rsi
    leaq c(%rip), %rdx
    leaq d(%rip), %rcx
    leaq e(%rip), %r8
    leaq aa(%rip), %r9
    leaq bb(%rip), %rax
    movq %rax, (%rsp)
    leaq cc(%rip), %rax
    movq %rax, 8(%rsp)
    movss %xmm0, 16(%rsp)
    call dummy@PLT
    addq $32, %rsp

    incl %r12d
    cmpl $100000, %r12d
    jne outer_loop

    movl %r13d, %eax
    addl $1, %eax
    popq %r14
    popq %r13
    popq %r12
    popq %rbp
    ret

.size s331_inner, .-s331_inner

.section .bss
.comm a, 128000, 64
.comm b, 128000, 64
.comm c, 128000, 64
.comm d, 128000, 64
.comm e, 128000, 64
.comm aa, 262144, 64
.comm bb, 262144, 64
.comm cc, 262144, 64

Example 2：`s332`（x86_64）