Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

Linjia Kang 1‡, Zhimin Wang1‡, Yongkang Zhang1‡, Duo Wu1, Jinghe Wang1, Ming Ma2, Haopeng Yan2, Zhi Wang1*
1Tsinghua University
2Kuaishou Technology

‡Equal Contribution. *Corresponding Author. Yongkang's work was done during his internship at Tsinghua University.

Key Innovation

MobileGen mimics how humans gradually learn to master complex tasks and dynamically generates training data that closely aligns with the capability frontier of GUI agents, thereby driving the continual evolution of the agents' capability.

Abstract

Large-scale, high-quality interaction trajectories are essential for advancing mobile Graphical User Interface (GUI) agents. While existing methods typically rely on labor-intensive human demonstrations or automated model exploration to generate GUI trajectories, they lack fine-grained control over task difficulty. This fundamentally restricts learning effectiveness due to the mismatch between the training difficulty and the agent's capabilities. Inspired by how humans acquire skills through progressively challenging tasks, we propose MobileGen, a novel data generation framework that adaptively aligns training difficulty with the GUI agent's capability frontier. Specifically, MobileGen explicitly decouples task difficulty into structural (e.g., trajectory length) and semantic (e.g., task goal) dimensions. It then iteratively evaluates the agent on a curated prior dataset to construct a systematic profile of its capability frontier across these two dimensions. With this profile, the probability distribution of task difficulty is adaptively computed, from which the target difficulty for the next round of training can be sampled. Guided by the sampled difficulty, a multi-agent controllable generator is finally used to synthesize high-quality interaction trajectories along with corresponding task instructions. Extensive experiments show that MobileGen consistently outperforms existing data generation methods by improving the average performance of GUI agents by 1.57 times across multiple challenging benchmarks. This highlights the importance of capability-aligned data generation for effective mobile GUI agent training.

Methodology

MobileGen Method Overview

Overview of our proposed pipeline MobileGen.

Key Stages of Pipeline

MobileGen continuously pushes the frontier of the agent's capability while maximizing training effectiveness by dynamically setting the challenge point that align data difficulty with the agent's current capability.

  • Agent Capability Profiling: The student agent (the agent to be tuned) is evaluated on a prior dataset to derive structural and semantic capability profiles.
  • Difficulty Distribution Generation: Based on the profiles, the challenge point is set to form the desired difficulty distribution.
  • Difficulty-Aware Trajectory Generation: Guided by the sampled difficulty parameters, explorer and supervisor collaborate to generate interaction trajectories. The reasoning traces and instructions will then be reconstructed via inverse synthesis.
MCG Method Overview

Detailed workflow of MCG.

Multi-Agent Controllable Generator (MCG)

MCG generates difficulty-aware training trajectories controlled by sampled difficulty parameters through multi-agent collaboration.

  • During Interaction Trajectory Generation, the supervisor allocates exploration step budgets for each application before the exploration begins and dynamically manages the explorer's step usage across applications while performing rollbacks to correct interaction errors.
  • Following exploration, the synthesizer reconstructs step-level thoughts and trajectory-level instructions of all primitive trajectories.

Main Results

MobileGen significantly outperforms existing data generation methods, improving the average performance of GUI agents by 1.57 times across multiple challenging benchmarks.

Model AndroidWorld AndroidControl-Curated-Hard AndroidControl-Curated-Easy
SR (%) Type (%) Grounding (%) SR (%) Type (%) Grounding (%) SR (%)
GPT-5 46.6 66.2 14.5 16.6 79.0 16.1 29.6
GUI-Owl-7B 54.3 62.2 64.5 41.6 71.0 83.9 61.8
UI-TARS-1.5-7B 39.7 62.6 37.6 28.2 70.6 64.9 51.6
Zero-shot
Qwen3-VL-4B-Inst. 25.0 70.8 26.4 25.6 81.4 49.1 37.8
Qwen3-VL-8B-Inst. 31.0 71.2 27.9 24.0 78.6 51.9 34.2
InternVL3-14B 29.3 69.4 32.7 27.6 78.0 47.5 33.6
OS-Genesis
Qwen3-VL-4B-Inst. 31.9 72.8 29.7 27.8 83.6 68.4 54.4
Qwen3-VL-8B-Inst. 37.9 75.2 31.2 29.6 85.4 69.6 56.8
InternVL3-14B 35.3 71.6 37.9 31.2 81.8 69.9 55.0
GUI-Rewalk
Qwen3-VL-4B-Inst. 33.6 71.8 28.8 29.2 82.8 67.1 53.2
Qwen3-VL-8B-Inst. 41.4 73.0 33.3 31.6 84.2 68.7 55.8
InternVL3-14B 31.9 73.8 36.4 32.8 80.6 66.5 53.6
Ours (MobileGen)
Qwen3-VL-4B-Inst. 40.5 77.6 37.9 35.6 86.8 70.9 58.6
Qwen3-VL-8B-Inst. 45.7 80.8 45.5 38.4 88.6 72.2 61.2
InternVL3-14B 42.2 78.6 46.7 36.6 86.8 68.4 58.8

Results on AndroidWorld and AndroidControl-Curated. "SR" denotes task success rate, "Type" indicates action-type accuracy, and "Grounding" represents grounding accuracy computed on annotated subsets. Results in bold and underline denote the best and second-best results of each trained backbone, respectively.


Model Tool Information Shopping Media Social Multi-Apps Overall
HL LL HL LL HL LL HL LL HL LL HL LL HL LL
GPT-5 38.675.527.765.423.961.026.971.436.674.824.465.529.268.4
GUI-Owl-7B 71.373.058.162.052.668.459.767.966.571.054.164.059.766.9
UI-TARS-1.5-7B 52.766.039.854.731.649.147.561.648.864.839.254.142.757.7
Zero-shot
Qwen3-VL-4B-Inst. 44.567.837.157.833.355.736.359.342.967.133.457.637.560.5
Qwen3-VL-8B-Inst. 44.770.835.359.631.258.536.063.041.469.134.460.537.063.2
InternVL3-14B 23.344.418.641.418.138.215.134.622.543.618.337.819.440.1
OS-Genesis
Qwen3-VL-4B-Inst. 55.176.446.464.047.366.247.868.358.276.150.269.949.070.0
Qwen3-VL-8B-Inst. 60.678.851.067.049.469.152.174.161.179.656.074.453.673.3
InternVL3-14B 62.875.152.866.146.768.548.570.359.478.053.872.852.171.0
GUI-ReWalk
Qwen3-VL-4B-Inst. 57.278.148.266.247.071.951.474.456.279.352.272.251.872.1
Qwen3-VL-8B-Inst. 63.980.553.969.851.171.056.978.663.082.558.076.956.875.9
InternVL3-14B 59.977.650.565.548.267.253.672.360.583.455.274.053.472.4
Ours (MobileGen)
Qwen3-VL-4B-Inst. 58.977.749.566.149.170.852.176.360.281.154.274.254.875.1
Qwen3-VL-8B-Inst. 63.579.953.569.153.172.758.076.964.280.360.078.959.881.3
InternVL3-14B 61.178.952.167.850.270.452.574.362.182.957.276.056.475.4

Results on GUI-Odyssey. "HL" and "LL" denote high-level and low-level task instructions, respectively. The metric reported is the action match score (AMS). Results in bold and underline denote the best and second-best results of each trained backbone, respectively.

Deep Dive

Ablation study

Ablation Study

These results reveal that effective GUI agent learning critically depends on jointly modeling difficulty distribution and trajectory quality control.

Impact of Challenge Point

Impact of Challenge Point

This observation suggests that GUI agent training follows a frontier-sensitive learning dynamic, where progress is maximized when the difficulty of generated task trajectories stays near the agent’s evolving capability boundary.

Citation

Please cite our work if you find it useful for your research: