AI Tools May Degrade Documents in Repeated Editing, Microsoft Study Warns

Share


Artificial intelligence (AI) tools are increasingly acting as workplace assistants, helping with tasks such as drafting emails, editing code, and managing complex documents. However, a recent Microsoft Research study warns that over-reliance on these systems could unintentionally reduce the quality of work they are meant to enhance.

The study finds that large language models (LLMs) like ChatGPT and Claude may gradually degrade documents when used for repeated editing tasks. In extended workflows, even advanced models were observed to “corrupt” a significant portion of content, with an average loss or distortion of around 25% after multiple editing cycles, according to the researchers.

The concept of AI delegation—where users assign tasks to AI systems instead of manually editing content—is gaining popularity in modern workplaces. Sometimes referred to as “delegated work” or “vibe coding,” this approach promises greater efficiency but relies heavily on accuracy and consistency.

The researchers caution that this trust may be premature. “Delegation requires trust—the expectation that the LLM will faithfully execute the task without introducing errors,” the paper notes.

To evaluate this, the team introduced a benchmark called DELEGATE-52, testing 19 AI models across 52 professional domains, including coding, accounting, music notation, and textile design. The simulations involved multi-step document editing workflows that reflect real-world usage.

Findings show that errors often begin subtly but accumulate with repeated edits. These “sparse but severe” mistakes can include missing sentences, incorrect numbers, or altered formatting. Over long workflows, average document degradation reached nearly 50% across models.

Even top-performing systems such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 showed around 25% degradation after 20 iterative interactions.

A key insight from the study is that strong short-term performance does not guarantee long-term reliability. As tasks become more complex and multi-step, accuracy tends to decline significantly.

Interestingly, giving models access to external tools like code execution or file editors did not improve performance. In some cases, it slightly worsened results due to increased complexity and context overload.

The study also highlights clear differences across task types. Structured tasks like programming performed relatively well, while natural language-heavy domains—such as financial documents and creative writing—showed higher error rates.

With AI adoption increasing in workplaces, the findings suggest that full automation of document handling remains risky. Researchers emphasize that human supervision is still necessary, especially for high-stakes tasks.

“Current LLMs are unreliable delegates,” the study concludes, noting that while AI systems are improving rapidly, they are not yet dependable enough for fully autonomous long-horizon workflows.


Recent Random Post: