I don’t have an idealistic view of the software, I understand perfectly well that software is buggy as hell. And I think that retrying the tests is a practice that massively contributes to this bugginess.
I can understand the issue with 3-4 external services chained together, although I would still solve it by using test waits or asynchronous assertions rather than relying on the assumption that everything should happen instantly (it never does in production). But dropping this 50+ million users figure seems irrelevant for the E2E tests or their stability, same goes for microservice/microfrontend architecture. It might as well be monolithic, why do your E2E tests care? On the contrary, I would say that retrying tests a microservice architecture is even a greater sin, since data races and concurrency issues occur there more often.
I assume you’re not talking about E2E tests that are part of the merge pipeline (which is a great practice if they’re fast enough), but rather executed on schedule. That somewhat alleviates the practice of test retries, but still doesn’t completely justify it. My main issue with test retries is that it hides more issues than it solves. Think of the number of stability issues in your own code that you’re constantly missing by using test retries. The actual user is not an E2E test, they might or might not retry using your faulty software, they could as well just move on to another vendor.