Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    “Bunker Mentality” in AI: Are We There Yet?

    October 14, 2025

    Israel Hamas deal: The hostage, ceasefire, and peace agreement could have a grim lesson for future wars.

    October 14, 2025

    Astaroth: Banking Trojan Abusing GitHub for Resilience

    October 13, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Software Development»How far can we push AI autonomy in code generation?
    Software Development

    How far can we push AI autonomy in code generation?

    big tee tech hubBy big tee tech hubAugust 6, 20250016 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    How far can we push AI autonomy in code generation?
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    When people ask about the future of Generative AI in coding, what they
    often want to know is: Will there be a point where Large Language Models can
    autonomously generate and maintain a working software application? Will we
    be able to just author a natural language specification, hit “generate” and
    walk away, and AI will be able to do all the coding, testing and deployment
    for us?

    To learn more about where we are today, and what would have to be solved
    on a path from today to a future like that, we ran some experiments to see
    how far we could push the autonomy of Generative AI code generation with a
    simple application, today. The standard and the quality lens applied to
    the results is the use case of developing digital products, business
    application software, the type of software that I’ve been building most in
    my career. For example, I’ve worked a lot on large retail and listings
    websites, systems that typically provide RESTful APIs, store data into
    relational databases, send events to each other. Risk assessments and
    definitions of what good code looks like will be different for other
    situations.

    The main goal was to learn about AI’s capabilities. A Spring Boot
    application like the one in our setup can probably be written in 1-2 hours
    by an experienced developer with a powerful IDE, and we don’t even bootstrap
    things that much in real life. However, it was an interesting test case to
    explore our main question: How might we push autonomy and repeatability of
    AI code generation?

    For the vast majority of our iterations, we used Claude-Sonnet models
    (either 3.7 or 4). These in our experience consistently show the highest
    coding capabilities of the available LLMs, so we found them the most
    suitable for this experiment.

    The strategies

    We employed a set of “strategies” one by one to see if and how they can
    improve the reliability of the generation and quality of the generated
    code. All of the strategies were used to improve the probability that the
    setup generates a working, tested and high quality codebase without human
    intervention. They were all attempts to introduce more control into the
    generation process.

    Choice of the tech stack

    We chose a simple “CRUD” API backend (Create, Read, Update, Delete)
    implemented in Spring Boot as the goal of the generation.

    How far can we push AI autonomy in code generation?

    Figure 1: Diagram of the intended
    target application, with typical Spring Boot layers of persistence,
    services, and controllers. Highlights how each layer should have tests,
    plus a set of E2E tests.

    As mentioned before, building an application like this is a quite
    simple use case. The idea was to start very simple, and then if that
    works, crank up the complexity or variety of requirements.

    How can this increase the success rate?

    The choice of Spring Boot as the target stack was in itself our first
    strategy of increasing the chances of success.

    • A common tech stack that should be quite prevalent in the training
      data
    • A runtime framework that can do a lot of the heavy lifting, which means
      less code to generate for AI
    • An application topology that has very clearly established patterns:
      Controller -> Service -> Repository -> Entity, which means that it is
      relatively easy to give AI a set of patterns to follow

    Multiple agents

    We split the generation process into multiple agents. “Agent” here
    means that each of these steps is handled by a separate LLM session, with
    a specific role and instruction set. We did not make any other
    configurations per step for now, e.g. we did not use different models for
    different steps.

    agent workflow

    Figure 2: Multiple agents in the generation
    process: Requirements analyst -> Bootstrapper -> Backend designer ->
    Persistence layer generator -> Service layer generator -> Controller layer
    generator -> E2E tester -> Code reviewer

    To not taint the results with subpar coding abilities, we used a setup
    on top of an existing coding assistant that has a bunch of coding-specific
    abilities already: It can read and search a codebase, react to linting
    errors, retry when it fails, and so on. We needed one that can orchestrate
    subtasks with their own context window. The only one we were aware of at the time
    that can do that is Roo Code, and
    its fork Kilo Code. We used the latter. This gave
    us a facsimile of a multi-agent coding setup without having to build
    something from scratch.

    subtasking setup

    Figure 3: Subtasking setup in Kilo: An
    orchestrator session delegates to subtask sessions

    With a carefully curated allow-list of terminal commands, a human only
    needs to hit “approve” here and there. We let it run in the background and
    checked on it every now and then, and Kilo gave us a sound notification
    whenever it needed input or an approval.

    How can this increase the success rate?

    Even though technically the context window sizes of LLMs are
    increasing, LLM generation results still become more hit and miss the
    longer a session becomes. Many coding assistants now offer the ability to
    compress the context intermittently, but a common advice to coders using
    agents is still that they should restart coding sessions as frequently as
    possible.

    Secondly, it is a very established prompting practice is to assign
    roles and perspectives to LLMs to increase the quality of their results.
    We could take advantage of that as well with this separation into multiple
    agentic steps.

    Stack-specific over general purpose

    As you can maybe already tell from the workflow and its separation
    into the typical controller, service and persistence layers, we didn’t
    shy away from using techniques and prompts specific to the Spring target
    stack.

    How can this increase the success rate?

    One of the key things people are excited about with Generative AI is
    that it can be a general purpose code generator that can turn natural
    language specifications into code in any stack. However, just telling
    an LLM to “write a Spring Boot application” is not going to yield the
    high quality and contextual code you need in a real-world digital
    product scenario without further instructions (more on that in the
    results section). So we wanted to see how stack-specific our setup would
    have to become to make the results high quality and repeatable.

    Use of deterministic scripts

    For bootstrapping the application, we used a shell script rather than
    having the LLM do this. After all, there is a CLI to create an up to
    date, idiomatically structured Spring Boot application, so why would we
    want AI to do this?

    The bootstrapping step was the only one where we used this technique,
    but it’s worth remembering that an agentic workflow like this by no
    means has to be entirely up to AI, we can mix and match with “proper
    software” wherever appropriate.

    Code examples in prompts

    Using example code snippets for the various patterns (Entity,
    Repository, …) turned out to be the most effective strategy to get AI
    to generate the type of code we wanted.

    How can this increase the success rate?

    Why do we need these code samples, why does it matter for our digital
    products and business application software lens?

    The simplest example from our experiment is the use of libraries. For
    example, if not specifically prompted, we found that the LLM frequently
    uses javax.persistence, which has been superseded by
    jakarta.persistence. Extrapolate that example to a large engineering
    organization that has a specific set of coding patterns, libraries, and
    idioms that they want to use consistently across all their codebases.
    Sample code snippets are a very effective way to communicate these
    patterns to the LLM, and ensure that it uses them in the generated
    code.

    Also consider the use case of AI maintaining this application over time,
    and not just creating its first version. We would want it to be ready to use
    a new framework or new framework version as and when it becomes relevant, without
    having to wait for it to be dominant in the model’s training data. We would
    need a way for the AI tooling to reliably pick up on these library nuances.

    Reference application as an anchor

    It turned out that maintaining the code examples in the natural
    language prompts is quite tedious. When you iterate on them, you don’t
    get immediate feedback to see if your sample would actually compile, and
    you also have to make sure that all the separate samples you provide are
    consistent with each other.

    To improve the developer experience of the developer implementing the
    agentic workflow, we set up a reference application and an MCP (Model
    Context Protocol) server that can provide the sample code to the agent
    from this reference application. This way we could easily make sure that
    the samples compile and are consistent with each other.

    reference application

    Figure 4: Reference application as an
    anchor

    Generate-review loops

    We introduced a review agent to double check AI’s work against the
    original prompts. This added an additional safety net to catch mistakes
    and ensure the generated code adhered to the requirements and
    instructions.

    How can this increase the success rate?

    In an LLM’s first generation, it often doesn’t follow all of the
    instructions correctly, especially when there are a lot of them.
    However, when asked to review what it created, and how it matches the
    original instructions, it’s usually quite good at reasoning about the
    fidelity of its work, and can fix many of its own mistakes.

    Codebase modularization

    We asked the AI to divide the domain into aggregates, and use those
    to determine the package structure.

    sample package structure widened

    Figure 5: Sample of modularised
    package structure

    This is actually an example of something that was hard to get AI to
    do without human oversight and correction. It is a concept that is also
    hard for humans to do well.

    Here is a prompt excerpt where we ask AI to
    group entities into aggregates during the requirements analysis
    step:

              An aggregate is a cluster of domain objects that can be treated as a
              single unit, it must stay internally consistent after each business
              operation.
    
              For each aggregate:
              - Name root and contained entities
              - Explain why this aggregate is sized the way it is
              (transaction size, concurrency, read/write patterns).

    We didn’t spend much effort on tuning these instructions and they can probably be improved,
    but in general, it’s not trivial to get AI to apply a concept like this well.

    How can this increase the success rate?

    There are many benefits of code modularisation that
    improve the quality of the runtime, like performance of queries, or
    transactionality concerns. But it also has many benefits for
    maintainability and extensibility – for both humans and AI:

    • Good modularisation limits the number of places where a change needs to be
      made, which means less context for the LLM to keep in mind during a change.
    • You can re-apply an agentic workflow like this one to one module at a time,
      limiting token usage, and reducing the size of a change set.
    • Being able to clearly limit an AI task’s context to specific code modules
      opens up possibilities to “freeze” all others, to reduce the chance of
      unintended changes. (We did not try this here though.)

    Results

    Round 1: 3-5 entities

    For most of our iterations, we used domains like “Simple product catalog”
    or “Book tracking in a library”, and edited down the domain design done by the
    requirements analysis phase to a maximum of 3-5 entities. The only logic in
    the requirements were a few validations, other than that we just asked for
    straightforward CRUD APIs.

    We ran about 15 iterations of this category, with increasing sophistication
    of the prompts and setup. An iteration for the full workflow usually took
    about 25-30 minutes, and cost $2-3 of Anthropic tokens ($4-5 with
    “thinking” enabled).

    Ultimately, this setup could repeatedly generate a working application that
    followed most of our specifications and conventions with hardly any human
    intervention. It always ran into some errors, but could frequently fix its
    own errors itself.

    Round 2: Pre-existing schema with 10 entities

    To crank up the size and complexity, we pointed the workflow at a
    pared down existing schema for a Customer Relationship Management
    application (~10 entities), and also switched from in-memory H2 to
    Postgres. Like in round 1, there were a few validation and business
    rules, but no logic beyond that, and we asked it to generate CRUD API
    endpoints.

    The workflow ran for 4–5 hours, with quite a few human
    interventions in between.

    As a second step, we provided it with the full set of fields for the
    main entity, asked it to expand it from 15 to 50 fields. This ran
    another 1 hour.

    A game of whac-a-mole

    Overall, we could definitely see an improvement as we were applying
    more of the strategies. But ultimately, even in this quite controlled
    setup with very specific prompting and a relatively simple target
    application, we still found issues in the generated code all the time.
    It’s a bit like whac-a-mole, every time you run the workflow, something
    else happens, and you add something else to the prompts or the workflow
    to try and mitigate that.

    These were some of the patterns that are particularly problematic for
    a real world business application or digital product:

    Overeagerness

    We frequently got additional endpoints and features that we did not
    ask for in the requirements. We even saw it add business logic that we
    didn’t ask for, e.g. when it came across a domain term that it knew how
    to calculate. (“Pro-rated revenue, I know what that is! Let me add the
    calculation for that.”)

    Possible mitigation

    Can be reigned in to an extent with the prompts, and repeatedly
    reminding AI that we ONLY want what is specified. The reviewer agent can
    also help catch some of the excess code (though we’ve seen the reviewer
    delete too much code in its attempt to fix that). But this still
    happened in some shape or form in almost all of our iterations. We made
    one attempt at lowering the temperature to see if that would help, but
    as it was only one attempt in an earlier version of the setup, we can’t
    conclude much from the results.

    Gaps in the requirements will be filled with assumptions

    A priority: String field in an entity was assumed by AI to have the
    value set “1”, “2”, “3”. When we introduced the expansion to more fields
    later, even though we didn’t ask for any changes to the priority
    field, it changed its assumptions to “low”, “medium”, “high”. Apart from
    the fact that it would be a lot better to have introduced an Enum
    here, as long as the assumptions stay in the tests only, it might not be
    a big issue yet. But this could be quite problematic and have heavy
    impact on a production database if it would happen to a default
    value.

    Possible mitigation

    We’d somehow have to make sure that the requirements we give are as
    complete and detailed as possible, and include a value set in this case.
    But historically, we have not been great at that… We have seen some AI
    be very helpful in helping humans find gaps in their requirements, but
    the risk of incomplete or incoherent requirements always remains. And
    the goal here was to test the boundaries of AI autonomy, so that
    autonomy is definitely limited at this requirements step.

    Brute force fixes

    “[There is a ] lazy-loaded relationship that’s causing JSON
    serialization problems. Let me fix this by adding @JsonIgnore to the
    field”. Similar things have also happened to me multiple times in
    agent-assisted coding sessions, from “the build is running out of
    memory, let’s just allocate more memory” to “I can’t get the test to
    work right now, let’s skip it for now and move on to the next task”.

    Possible mitigation

    We don’t have any idea how to prevent this.

    Declaring success in spite of red tests

    AI frequently claimed the build and tests were successful and moved
    on to the next step, even though they were not, and even though our
    instructions explicitly stated that the task is not done if build or
    tests are failing.

    Possible mitigation

    This might be more easy to fix than the other things mentioned here,
    by a more sophisticated agent workflow setup that has deterministic
    checkpoints and does not allow the workflow to continue unless tests are
    green. However, experience from agentic workflows in business process
    automation have already shown that LLMs find ways to get around
    that
    . In the case of code generation,
    I would imagine they could still delete or skip tests to get beyond that
    checkpoint.

    Static code analysis issues

    We ran SonarQube static code analysis on
    two of the generated codebases, here is an excerpt of the issues that
    were found:

    Issue Severity Sonar tags Notes
    Replace this usage of ‘Stream.collect(Collectors.toList())’ with ‘Stream.toList()’ and ensure that the list is unmodified. Major java16 From Sonar’s “Why”: The key problem is that .collect(Collectors.toList()) actually returns a mutable kind of List while in the majority of cases unmodifiable lists are preferred.
    Merge this if statement with the enclosing one. Major clumsy In general, we saw a lot of ifs and nested ifs in the generated code, in particular in mapping and validation code. On a side note, we also saw a lot of null checks with `if` instead of the use of `Optional`.
    Remove this unused method parameter “event”. Major cert, unused From Sonar’s “Why”: A typical code smell known as unused function parameters refers to parameters declared in a function but not used anywhere within the function’s body. While this might seem harmless at first glance, it can lead to confusion and potential errors in your code.
    Complete the task associated to this TODO comment. Info AI left TODOs in the code, e.g. “// TODO: This would be populated by joining with lead entity or separate service calls. For now, we’ll leave it null – it can be populated by the service layer”
    Define a constant instead of duplicating this literal (…) 10 times. Critical design From Sonar’s “Why”: Duplicated string literals make the process of refactoring complex and error-prone, as any change would need to be propagated on all occurrences.
    Call transactional methods via an injected dependency instead of directly via ‘this’. Critical From Sonar’s “Why”: A method annotated with Spring’s @Async, @Cacheable or @Transactional annotations will not work as expected if invoked directly from within its class.

    I would argue that all of these issues are relevant observations that lead to
    harder and riskier maintainability, even in a world where AI does all the
    maintenance.

    Possible mitigation

    It is of course possible to add an agent to the workflow that looks at the
    issues and fixes them one by one. However, I know from the real world that not
    all of them are relevant in every context, and teams often deliberately mark
    issues as “won’t fix”. So there is still some nuance



    Source link

    autonomy Code Generation push
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    A Guide to Develop Banking Software

    October 13, 2025

    Google unveils Gemini Enterprise to offer companies a more unified platform for AI innovation

    October 12, 2025

    From vibe coding to vibe deployment: Closing the prototype-to-production gap

    October 12, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    “Bunker Mentality” in AI: Are We There Yet?

    October 14, 2025

    Israel Hamas deal: The hostage, ceasefire, and peace agreement could have a grim lesson for future wars.

    October 14, 2025

    Astaroth: Banking Trojan Abusing GitHub for Resilience

    October 13, 2025

    ios – Differences in builds between Xcode 16.4 and Xcode 26

    October 13, 2025
    Advertisement
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    “Bunker Mentality” in AI: Are We There Yet?

    October 14, 2025

    Israel Hamas deal: The hostage, ceasefire, and peace agreement could have a grim lesson for future wars.

    October 14, 2025

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2025 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.