Nazmus Ashrafi

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems

LLM code generation has moved from single-shot prompting to multi-agent orchestrations — analyst, coder, tester, and debugger pipelines. These systems are almost always judged on functional correctness (pass@1). But the code they produce is also read, reviewed, debugged, and maintained by humans, and that structural complexity carries a downstream cost that pass@1 never captures.

This preprint asks a question the field has largely left open: does the choice of generation architecture change the structural complexity of the code, and if so, which orchestration layers carry the cost?

Setup

Six architectures built from three composable layers: role decomposition (R, an Analyst), testing with bounded iteration (T, a Tester), and runtime debugging (D, an execution-grounded Debugger). The configurations are Basic, AC, ACT, Debugger, AC+Debugger, and ACT+Debugger.
Two models from the GPT-4o family (gpt-4o-2024-08-06 and gpt-4o-mini-2024-07-18).
164 HumanEval tasks, every task solved under all six architectures and both models — 1,968 paired observations.
Five RADON complexity metrics: source lines of code (SLOC), cyclomatic complexity, and Halstead Volume, Difficulty, and Effort.
A paired, non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's W and matched-pairs rank-biserial effect sizes).

Key findings

The six architectures collapse into two complexity clusters. A lean cluster (Basic, Debugger, AC+Debugger) and a heavy cluster (AC, ACT, ACT+Debugger), separated by a 50–130% complexity gap. The same split holds in both models and under a passing-only robustness check.
The layer effects are non-additive. The analyst–coder split (R) inflates complexity; the runtime debugger (D) deflates it — on the analyst–coder background it actively pulls the configuration back into the lean cluster; and the tester (T) re-inflates it.
The extra complexity buys no accuracy. The heavy cluster generates 50–60% more code yet does not exceed the lean cluster on pass@1 — in some cells it falls below Basic. The two architectures tied for the best pass@1 (Debugger and AC+Debugger) are both in the lean cluster.
Architecture is a broader lever than prompt phrasing. Where prior prompt-pattern work moved only line-count measures, generation architecture shifts all five complexity metrics — reaching the control-flow and vocabulary structure of the code, not just its length.

Why it matters for practitioners

Correctness is an incomplete scoreboard. Report cheap, automatically-computed complexity metrics alongside pass@1 when comparing architectures.
Elaborate pipelines were dominated. The heavy multi-agent chains were costlier, slower, and structurally heavier with no correctness return on this benchmark.
Prefer execution-grounded feedback. Feedback grounded in executing the code (a debugger) keeps output lean and can simplify it; adding more conversational planning/critique roles inflates it. Architectural elaboration should be justified by a measured benefit, not assumed.

This is a preprint and is not published in a venue — it serves as an open archival record.

arXiv: 2606.00308
Code: github.com/nazmus-ashrafi/Multi-Agent-Code-Complexity