Source-led article
AI Coding Agents Struggle with Precise Code Localization, Study Reveals

A recent international study has highlighted a significant limitation in current AI coding agents: their inability to precisely localize critical lines of code within identified files. While these agents, including models like Claude Code and Codex, are adept at finding the correct files associated with a bug, they frequently overlook the specific lines that are essential for successful remediation. This finding, based on the new SWE-Explore benchmark, suggests a hidden weakness in how AI models approach software debugging.
The conventional evaluation of AI coding agents has primarily focused on whether a bug was fixed, without detailed insight into the repair process. This “black box” approach obscured whether failures stemmed from incorrect file identification or a misunderstanding of the code within the correct file. The SWE-Explore benchmark addresses this by isolating the code search phase, providing a clearer picture of agent performance in identifying relevant code sections.
Key facts
| Metric | Description |
|---|---|
| Benchmark | SWE-Explore |
| Focus | Separates code search from actual bug fix |
| Key Finding | Agents find files, miss critical lines |
| Dataset Size | 848 problems across 203 open-source projects |
Research Methodology
Developed by an international team that includes Shanghai Jiao Tong University, SWE-Explore evaluates the initial phase of AI debugging. Agents are given a bug description and a software project, then asked to return a ranked list of potentially relevant code sections. To establish ground truth, the researchers leveraged multiple successful solutions from advanced models like GPT-5.4 and Gemini 3 Pro. By analyzing which files and lines these powerful models actually examined to fix bugs, they could identify passages strongly indicative of useful context. The dataset spans 203 open-source projects and ten programming languages, with Python making up the majority of tasks.
Performance Gaps
The study compared traditional search methods against five general-purpose coding agents, including Claude Code, Codex, and OpenHands, alongside four specialized code search research systems. While traditional keyword searches performed poorly, AI agents showed clear superiority by searching projects incrementally. At the file level, agents performed well, accurately ranking and selecting relevant source files. However, this performance dropped significantly when the test focused on individual lines of code. General coding agents only managed to cover between 14% and 19% of the truly critical lines.
Impact of Language Models and Architectures
Interestingly, deploying stronger underlying language models did not substantially improve line coverage. The study found that file hit rates remained consistently higher than actual line coverage, regardless of the language model used. Various agent architectures, such as Claude Code, Codex, OpenHands, Mini-SWE-Agent, and AweAgent, exhibited strikingly similar scores across multiple metrics. An outlier was the CoSIL research system, which achieved higher line coverage by treating code as a network of interconnected blocks.
Implications for Indian Developers and Startups
For Indian developers and AI-driven startups focused on software development and automation, these findings are crucial. While AI coding assistants can accelerate initial debugging steps by identifying relevant files, relying solely on them for precise bug localization might lead to inefficient fixes or missed issues. Companies building or integrating AI into their development workflows should be aware of this limitation. The research suggests that a blended approach, where AI identifies potential areas and human developers perform the final critical line-by-line analysis, might be more effective in the short term. Future improvements in AI agents will need to focus on enhanced contextual understanding and more granular code analysis.
Contextual Clues and Future Directions
An ablation experiment in the study revealed that a minimum threshold of contextual clues is necessary for successful repairs. Success rates for easier tasks jumped significantly only when 50% to 75% of the core regions were visible to the repair model. This indicates that fixes do not improve gradually but require a critical mass of information. The key takeaway for developers and researchers is to “filter less, read more”—meaning AI agents need to process a broader context to identify relevant code lines accurately. This research builds upon previous benchmarks like SWE-bench, which tested AI agents against real GitHub issues, and contributes to a growing understanding of the nuanced challenges in AI-powered software development.
Source: The Decoder, https://the-decoder.com/ai-coding-agents-find-the-right-file-but-miss-the-exact-lines-that-matter-study-shows/