How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems
LLM code generation has moved from single-shot prompting to multi-agent orchestrations — analyst, coder, tester, and debugger pipelines. These systems are almost always judged on functional correctness (pass@1). But the code they produce is also read, reviewed, debugged, and maintained by humans, and that structural complexity carries a downstream cost that pass@1 never captures.
This preprint asks a question the field has largely left open: does the choice of generation architecture change the structural complexity of the code, and if so, which orchestration layers carry the cost?
Setup
- Six architectures built from three composable layers: role decomposition (
R, an Analyst), testing with bounded iteration (T, a Tester), and runtime debugging (D, an execution-grounded Debugger). The configurations areBasic,AC,ACT,Debugger,AC+Debugger, andACT+Debugger. - Two models from the GPT-4o family (
gpt-4o-2024-08-06andgpt-4o-mini-2024-07-18). - 164 HumanEval tasks, every task solved under all six architectures and both models — 1,968 paired observations.
- Five RADON complexity metrics: source lines of code (SLOC), cyclomatic complexity, and Halstead Volume, Difficulty, and Effort.
- A paired, non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's W and matched-pairs rank-biserial effect sizes).
Key findings
- The six architectures collapse into two complexity clusters. A lean cluster (
Basic,Debugger,AC+Debugger) and a heavy cluster (AC,ACT,ACT+Debugger), separated by a 50–130% complexity gap. The same split holds in both models and under a passing-only robustness check. - The layer effects are non-additive. The analyst–coder split (
R) inflates complexity; the runtime debugger (D) deflates it — on the analyst–coder background it actively pulls the configuration back into the lean cluster; and the tester (T) re-inflates it. - The extra complexity buys no accuracy. The heavy cluster generates 50–60% more code yet does not exceed the lean cluster on pass@1 — in some cells it falls below
Basic. The two architectures tied for the best pass@1 (DebuggerandAC+Debugger) are both in the lean cluster. - Architecture is a broader lever than prompt phrasing. Where prior prompt-pattern work moved only line-count measures, generation architecture shifts all five complexity metrics — reaching the control-flow and vocabulary structure of the code, not just its length.
Why it matters for practitioners
- Correctness is an incomplete scoreboard. Report cheap, automatically-computed complexity metrics alongside pass@1 when comparing architectures.
- Elaborate pipelines were dominated. The heavy multi-agent chains were costlier, slower, and structurally heavier with no correctness return on this benchmark.
- Prefer execution-grounded feedback. Feedback grounded in executing the code (a debugger) keeps output lean and can simplify it; adding more conversational planning/critique roles inflates it. Architectural elaboration should be justified by a measured benefit, not assumed.
This is a preprint and is not published in a venue — it serves as an open archival record.