From the paper, "Notably, for multiplying two 4 × 4 matrices, applying the algorithm of Strassen recursively results in an algorithm with 49 multiplications, which works over any field...AlphaEvolve is the first method to find an algorithm to multiply two 4 × 4 complex-valued matrices using 48 multiplications."
If you do naive matrix multiplication, you get a sense that you're doing similar work multiple times, but it's hard to quantify just what that duplicated work entails. Compare it to, for example, calculating the size of the union of two sets:
Total size = size(A) + size(B) - size(intersection(A, B))
You have to take out that extra intersection amount because you've counted it twice. What if you could avoid counting it twice in the first place? That's easy, you just iterate over each set once, keeping track of the elements you've already seen.
Strassen's algorithm keeps track of calculations that are needed later on. It's all reminiscent of dynamic programming.
What I find interesting is that it seems the extra savings requires complex values. There must be something going on in the complex plane that is again over-counting with the naive approach.
By googling "4x4 matrices multiplication 48" I ended up on this discussion on math.stackexchange https://math.stackexchange.com/questions/578342/number-of-el... , where in 2019 someone stated "It is possible to multiply two 4×4 matrix A,B with only 48 multiplications.", with a link to a PhD thesis. This might mean that the result was already known (I still have to check the outline of the algorithm).
It seems like you have some misconceptions about Strassen's alg:
1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)
2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.
Remember that GPUs have cache hierarchies and matching block sizes to optimally hit those caches is a big win that you often don't get by default, just because the number of important kernels times important GPUs times effort to properly tune one is greater than what people are willing to do for others for free in open source. Not to mention kernel fusion and API boundaries that socially force suboptimal choices for the sake of clarity and simplicity.
It's a very impressive result, but not magic, but also not cheating!
Absolutely - not arguing that the results are unreasonable to the point of illegitimacy - just curious to see when they perform as well as reported and how well the presented solutions generalize to different test cases - or if it's routing to different solutions based on certain criteria etc.
What it essentially does is a debugging/optimization loop where you change one thing, eval, repeat it again and compare results.
Previously we needed to have a human in the loop to do the change. Of course we have automated hyperparameter tuning (and similar things), but that only works only in a rigidly defined search space.
Will we see LLMs generating new improved LLM architectures, now fully incomprehensible to humans?
If I understood, isn't this software only as useful as the llm powering it is? It sounds like something very useful, but either I'm missing something or it put into a loop and a validator a "please optimize this code". Useful, but maybe not as revolutionary as the underlying llm tech itself
Edit the white paper says this: AlphaEvolve employs an ensemble of large language models. Specifically, we
utilize a combination of Gemini 2.0 Flash and Gemini 2.0 Pro. This ensemble approach allows
us to balance computational throughput with the quality of generated solutions. Gemini 2.0
Flash, with its lower latency, enables a higher rate of candidate generation, increasing the
number of ideas explored per unit of time. Concurrently, Gemini 2.0 Pro, possessing greater
capabilities, provides occasional, higher-quality suggestions that can significantly advance
the evolutionary search and potentially lead to breakthroughs. This strategic mix optimizes
the overall discovery process by maximizing the volume of evaluated ideas while retaining
the potential for substantial improvements driven by the more powerful model.
So, I remain of my opinion before. Furthermore, in the paper they don't present it as something extraordinary as some people here say it is, but as an evolution of another existing software, funsearch
> AlphaEvolve is accelerating AI performance and research velocity. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
I have not read this linked article, but your comment made me recall a discussion about a speed up of CUDA kernels presented by Sakana AI Labs. The researcher Ravid Shwartz Ziv at NYU posted about it on LinkedIn [1], and here is the Twitter post of interest [2]
"""
Yesterday's news about Sakana AI Labs provided an important lesson for all of us working with AI agents. Their announcement of an AI system that could supposedly optimize CUDA kernels to run 100x faster initially seemed like exactly the kind of use cases we've been hoping for in AI-assisted development.
Like many others, I was excited about it. After all, isn't this exactly what we want AI to do - help us optimize and improve our technical systems?
However, careful investigation by the community (on Twitter) revealed a different story. What really happened? The AI-generated CUDA kernel appeared to achieve incredible speedups, but the code was inadvertently reusing memory buffers containing previous results, essentially bypassing the actual computation. When properly evaluated, the kernel actually runs about 3x slower than the baseline.
"""
lmao this is exactly the kind of stuff I always see from Claude. It’s like adding a Skip() to a test and declaring it works now. “Well it’s a lot faster, I met the criteria of my TODOs cya”
I’ve seen it so much I kinda doubt it was “inadvertent” because they’re like seemingly intentional about their laziness, and will gaslight you about it too.
I picked one at random (B.2 -- the second autocorrelation inequality). Then, I looked up the paper that produced the previous state of the art (https://arxiv.org/pdf/0907.1379). It turns out that the authors had themselves found the upper bound by performing a numerical search using "Mathematica 6" (p.4). Not only did the authors consider this as a secondary contribution (p.2), but they also argued that finding something better was very doable, but not worth the pain:
"We remark that all this could be done rigorously, but one needs to control
the error arising from the discretization, and the sheer documentation
of it is simply not worth the effort, in view of the minimal gain." (p.5)
So at least in this case it looks like the advancement produced by AlphaEvolve was quite incremental (still cool!).
That right and In fact it’s the core purpose of the tool.
This is complex automation which by definition compresses the solution into a computable process that works more efficiently than the non-automated process
That, in fact, is the revolutionary part - you’re changing how energy is used to solve the problem.
This is exactly why I think the concerns about AI taking people's jobs are overblown. There is not a limited amount of knowledge work to do or things that can be invented or discovered. There's just work that isn't worth the effort, time or money to do right now, it doesn't mean it's not valuable, it's just not cost effective. If you reduce effort, time and money, then suddenly you can do it.
Like even just for programming. I just had an AI instrument my app for tracing, something I wanted to do for a while, but I didn't know how to do and didn't feel like figuring out how to do it. That's not work we were likely to hire someone to do or that would ever get done if the AI wasn't there. It's a small thing, but small things add up.
not worth the time for a human, but if you can throw AI at all of those "opportunities" it adds up substantially because all the chores can be automated.
Interestingly, it seems alphaevolve has already been in use for a year, and it is just now being publicly shown. The paper also mentions that it uses Gemini 2.0 (pro and flash), which creates a situation where Gemini 2.0 was used in a way to train Gemini 2.5.
I don't know if I would call this the fabled "self improving feedback loop", but it seems to have some degree of it. It also begs the question if Alphaevolve was being developed for a year, or has been in production for a year. By now it makes sense to hold back on sharing what AI research gems you have discovered.
If you have the brain power, the compute and control the hardware, what is there to prevent the take off feedback loop? Deepmind is at this point in the timeline uniquely positioned.
> If you have the brain power, the compute and control the hardware, what is there to prevent the take off feedback loop?
In the specific context of improving our AI hardware, for example, it's not as simple as coming up with a good idea -- hardware companies hire thousands of people to improve their designs. Prototypes need to be implemented, verified, quantified, compared thoroughly with the alternatives, then the idea is approved for production, which again leads to a cascade of implementation, verification, etc. until they can reach consumers. In order to make these improvements reach the consumer significantly faster you need to accelerate all of the steps of the very simplified pipeline mentioned earlier.
More generally, an argument can be made that we have been in that take off feedback loop for hundreds of years; it's just that the rate of improvement hasn't been as spectacular as we may have hoped for because each incremental step simply isn't that big of a deal and it takes quite a bit of time to reach the next one.
Running out of improvements after the first pass would prevent that. Who is to say this Alpha Evolve is not already obsolete, having already served its purpose?
Not to sound metaphysical or anything, but dependency on artificial intelligence seems to be something you would find at the peak of Mount Stupid (where the Darwin Awards are kept).
The fact that all computational problems have a best case complexity bound and there are generally diminishing marginal returns as algorithms approach that bound (i.e. lower hanging fruit are found first). E.g. no amount of intelligence is going to find an algorithm that can sort an array of any arbitrary Comparable type on a single CPU thread faster than O(n*log(n)). There's room for improvement in better adapting algorithms to cache hierarchy etc., but there's only a fixed amount of improvement that can be gained from that.
It is really about autonomy. Can it make changes to itself without human review? If it does, what is the proof such changes won't just stop at some point? All I am seeing here is a coder assist tool, and unsure how helpful inexplicable solutions are in the long run. Could result in an obtuse code base. Is that the point?
Cool, but don't get me wrong, isn't this essentially similar to Google's Co-Scientist, where multiple models are in a loop, passing context back and forth validating things? At its core, it's still a system of LLMs, which is impressive in execution but not fundamentally new.
LLMs are undoubtedly useful at tasks like code "optimisation" and detecting patterns or redundancies that humans might overlook, but this announcement feels like another polished, hypey blog post from Google.
What's also becoming increasingly confusing is their use of the "Alpha" branding. Originally, it was for breakthroughs like AlphaGo or AlphaFold, where there was a clear leap in performance and methodology. Now it's being applied to systems that, while sophisticated, don't really rise to the same level of impact.
edit: I missed the evaluator in my description, but an evaluation method is applied also in Co-Scientist:
"The AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs. Key reasoning steps include self-play–based scientific debate for novel hypothesis generation, ranking tournaments for hypothesis comparison, and an "evolution" process for quality improvement."[0]
No, you're extending the domain to which it is applicable. It's like noting that vaccines are useful for smallpox -- and the flu! Same idea, but different recipes.
For the people awaiting the singularity, lines like this written almost straight from science fiction:
> By suggesting modifications in the standard language of chip designers, AlphaEvolve promotes a collaborative approach between AI and hardware engineers to accelerate the design of future specialized chips."
Not necessarily. Theorem provers provide goals that can serve the same function as "debug text." Instead of interpreting the natural language chosen by the dev who wrote the compiler, these goals provide concrete, type-accurate statements that indicate the progress of an ongoing proof.
> AlphaEvolve enhanced the efficiency of Google's data centers, chip design and AI training processes — *including training the large language models underlying AlphaEvolve itself*.
Singularity people have been talking for decades about AI improving itself better than humans could, and how that results in runaway compounding growth of superintelligence, and now it's here.
Most code optimizations end up looking somewhat asymptotic towards a non-zero minimum.
If it takes you a week to find a 1% speedup, and the next 0.7% speedup takes you 2 weeks to find ... well, by using the 1% speedup the next one only takes you 13.86 days. This kind of small optimization doesn't lead to exponential gains.
That doesn't mean it's not worthwhile - it's great to save power & money and reduce iteration time by a small amount. And it combines with other optimizations over time. But this is in no way an example of the kind of thing that the singularity folks envisioned, regardless of the realism of their vision or not.
The singularity has always existed. It is located at the summit of Mount Stupid, where the Darwin Awards are kept. AI is really just psuedo-intelligence; an automated chairlift to peak overconfidence.
I love these confident claims! It sounds like you really know what you are talking about. It's either that or you are projecting. Could you elaborate? I for one find the level of intelligence quite real, I use AIs to do a lot of quite complex stuff for me nowadays. I have an agent that keeps my calendar, schedules appointments with people that want meetings with me, summarizes emails and add these summaries to notion and breaks them up in todo-lists, answers questions about libraries and APIs, writes most of my code (although I do need to hold it's hand and it cannot improve by learning from me).
I'm surprised by how little detail is given about the evolution procedure:
>In AlphaEvolve, the evolutionary database implements an algorithm that is inspired by a combination of the MAP elites algorithm [71] and island-based population models [80, 94].
"inspired by" is doing a lot of heavy lifting in this sentence. How do you choose dimensions of variation to do MAP-elites? How do you combine these two algorithms? How loose is the inspiration? It feels like a lot of the secret sauce is in the answers to these questions, and we get a single paragraph on how the evolution procedure works, which is so vague as to tell us almost nothing.
Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.
As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.
Skimmed quickly the paper. This does not look like RL. It's a genetic algorithm. In a previous life I was working on compbio (protein structure prediction), we built 100s of such heuristic based algorithm (monte carlo simulated annealing, ga..). The moment you have a good energy function (one that provide some sort of gradient), and a fast enough sampling function (llms), you can do looots of cool optmization with sufficient compute.
> This does not look like RL. It's a genetic algorithm.
couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.
This depends quite a bit of what you’re trying to optimize.
Gradient descent is literally following the negative of the gradient to minimize a function. It requires a continuous domain, either analytical or numerical derivatives of the cost function, and has well-known issues in narrow valleys and other complex landscapes.
It’s also a local minimization technique and cannot escape local minima by itself.
_Stochastic_ gradient descent and related techniques can overcome some of these difficulties, but are still more or less local minimization techniques and require differentiable and continuous scoring functions.
In contrast, genetic algorithms try to find global minima, do not require differentiable scoring functions, and can operate on both continuous and discrete domains. They have their own disadvantages.
Different techniques for different problems. The field of numerical optimization is vast and ancient for a reason.
You also need a base model that can satisfy the verifier at least some of the time. If all attempts fail, there's nothing there to reinforce. The reinforcement-learning algorithms themselves haven't changed much, but LLMs got good enough on many problems that RL could be applied. So for any given class of problem you still need enough human data to get initial performance better than random.
IMO RL can only solve "easy" problems. The reason RL works now is that unsupervised learning is a general recipe for transforming hard problems into easy ones. But it can't go all the way to solutions, you need RL on top for that. Yann LeCun's "cherry on top" analogy was right.
Are there platforms that make such training more streamlined? Say I have some definition of success for a given problem and it’s data how do I go about generating said RL model as fast and easily as possible?
We're working on an OSS industrial-grade version of this at TensorZero but there's a long way to go. I think the easiest out of the box solution today is probably OpenAI RFT but that's a partial solve with substantial vendor lock-in.
This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.
As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.
This isn't quite RL, right...?
It's an evolutionary approach on specifically labeled sections of code optimizing towards a set of metrics defined by evaluation functions written by a human.
I suppose you could consider that last part (optimizing some metric) "RL".
However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.
Most things are verifiable, just not with code. I'm not particularly excited for a world where everything is predictable. This is coming from a guy who loves forecasting/prediction modeling too, but one thing I hate about prediction modeling, especially from a hobbyist standpoint is data. Its very hard to get useful data. Investors will literally buy into hospital groups to get medical data for example.
There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.
I wonder if evolvable hardware [0] is the next step. In 1996, they optimized an FPGA using a genetic algorithm. It evolved gates disconnected from the rest of the circuit, but were required. The circuit seemed to use the minuscule magnetic fields from these disconnected gates, using the physical substrate rather than logical connections.
The paper does not give that many details about the evolution part. Normally, evolutionary algorithms contain some cross-over component where solutions can breed with each other. Otherwise it's better classified as hill climbing / beam search.
There's also 'evolutionary strategy' algorithms that do not use the typical mutation and crossover, but instead use a population of candidates (search samples) to basically approximate the gradient landscape.
I just hope there’s enough time between an actual AI and the “Let’s butcher this to pump out ads” version to publish a definitive version of wikipedia. After a few days with gemini delving into the guts of a spectrum analyser I’m very impressed of the capabilities. But my cynicism gland is fed by the nature of this everything-as-a-service. To run an LLM on your computer, locally, without internet, is just a few clicks. But that’s not the direction these software behemoths are going.
That's possibly a bit too general and an over statement...
Remember this approach only works for exploring an optimization for an already defined behavior of a function which has an accordingly well defined evaluation metric.
You can't write an evaluation function for each individual piece of or general "intelligence"...
That’s a really cool idea. I often used https://dannymator.itch.io/randomicon to come up with novel ideas, never thought of feeding random words to llm as a way of doing it.
Maybe the actual solution to the interpretability/blackbox problem is to not ask the llm to execute a given task, but rather to write deterministic programs that can execute the task.
This is very neat work! Will be interested in how they make this sort of thing available to the public but it is clear from some of the results they mention that search + LLM is one path to the production of net-new knowledge from AI systems.
> Here, the code between <<<<<<< SEARCH and======= is the exact segment to match in the current program version. The code between======= and >>>>>>> REPLACE is the new segment that will replace the original one. This allows for targeted updates to specific parts of the code.
Anybody knows how they can guarantee uniqueness of searched snipped within code block or is it even possible?
Finally—something directly relevant to my research (https://trishullab.github.io/lasr-web/).
Below are my take‑aways from the blog post, plus a little “reading between the lines.”
- One lesson DeepMind drew from AlphaCode, AlphaTensor, and AlphaChip is that large‑scale pre‑training, combined with carefully chosen inductive biases, enables models to solve specialized problems at—or above—human performance.
- These systems still require curated datasets and experts who can hand‑design task‑specific pipelines.
- In broad terms, FunSearch (and AlphaEvolve) follow three core design principles:
- Off‑the‑shelf LLMs can both generate code and recall domain knowledge. The “knowledge retrieval” stage may hallucinate, but—because the knowledge is expressed as code—we can execute it and validate the result against a custom evaluation function.
- Gradient descent is not an option for discrete code; a zeroth‑order optimizer—specifically evolutionary search—is required.
- During evolution we bias toward (1) _succinct_ programs and (2) _novel_ programs. Succinctness is approximated by program length; novelty is encouraged via a MAP‑Elites–style “novelty bias,” yielding a three‑dimensional Pareto frontier whose axes are _performance, simplicity,_ and _novelty_ (see e.g. OE‑Dreamer: (https://claireaoi.github.io/OE-Dreamer/).
Pros
- Any general‑purpose foundation model can be coupled with evolutionary search.
- A domain expert merely supplies a Python evaluation function (with a docstring explaining domain‑specific details). Most scientists I've talked with - astronomers, seismologists, neuroscientists, etc. - already maintain such evaluation functions for their own code.
- The output is an interpretable program; even if it overfits or ignores a corner case, it often provides valuable insight into the regimes where it succeeds.
Cons
- Evolutionary search is compute‑heavy and LLM calls are slow unless heavily optimized. In my projects we need ≈ 60 k LLM calls per iteration to support a reasonable number of islands and populations. In equation discovery we offset cost by making ~99 % of mutations purely random; every extra 1 % of LLM‑generated mutations yields roughly a 10 % increase in high‑performing programs across the population.
- Evaluation functions typically undergo many refinement cycles; without careful curation the search may converge to a useless program that exploits loopholes in the metric.
Additional heuristics make the search practical. If your evaluator is slow, overlap it with LLM calls. To foster diversity, try dissimilar training: run models trained on different data subsets and let them compete. Interestingly, a smaller model (e.g., Llama-3 8 B) often outperforms a larger one (Llama‑3 70 B) simply because it emits shorter programs.
Interesting that this wasn't tested on ARC-AGI. Francois has always said he believed program search of this type was the key to solving it. It seems like potentially this approach could do very well.
I'm surprised I'm not able to find this out - can some one tell me whether AlphaEvolve involves backprop or not?
I honestly have no idea how AlphaEvolve works - does it work purely on the text level? Meaning I might be able to come up with something like AlphaEvolve with some EC2's and a Gemini API access?
Interestingly, they improved matrix multiplication and there was a paper on Arxiv a few days ago [1] that also improved matrix multiplication and the only case common to both is <4,5,6> (multiplying 4x5 matrix with 5x6 matrix) and they both improved it from 93 to 90.
There’s been a ton of work on multiplying very large matrices. But actually, I have no idea—how well explored is the space of multiplying small matrices? I guess I assume that, like, 4x4 is done very well, and everything else is kind of… roll the dice.
Not really, only when looking back at the 60's and 70's when most of the important algorithms I use were invented. For example, LR parsing and A*.
Just wait until the MBA's and politicians learn about this Adam Smith guy. A pipedream now, but maybe in the future schools will be inspired to teach about dialectical reasoning and rediscover Socrates.
[end of snark]
Sorry, I'm getting tired of ad-fueled corporations trying to get me to outsource critical thinking.
AI will indeed kill the leetcode interview - because once it replaces human SWEs you don't really need to give leetcode-style brainteasers to any human anymore.
Not defending it but I think it was more of a test if you have the dedication (and general smarts) to grind them for a few months than software engineering skills.
Similar to hiring good students from famous universities even if most of CS isn't that applicable to day to day programming work, just because it's a signal that they're smart and managed to get through a difficult course.
Are you sure? From my experience, no AI assistant is fast enough to handle fast-paced questions on the code the candidate just wrote. Also, frequent requests to adjust variable names and deactivating pasting on the page make it extremely laborious for the candidate to get AI to modify the code on the screen.
I find it quite profound that there is no mention of the generation of corresponding code documentation. Without design diagrams, source and commit comments, etc the resulting code and changes will become incomprehensible unmaintainable. Unless that is somehow the point?
Software engineering will be completely solved. Even systems like v0 are astounding in their ability to generate code, and are very primitive to whats coming. I get downvoted on HN for this opinion, but its truly going to happen. Any system that can produce code, test the code, and iterate if needed will eventually outperform humans. Add in the reinforcement learning, where they can run the code, and train the model when it gets code generation right, and we are on our way to a whole different world.
"Coding" might be solved, but there is more to software engineering than just churning out code - i.e. what should we build? What are the requirements? Are they right? Whats the other dependencies we want to use - AWS or GCP for example? Why those and not others - whats the reason? How does this impact our users and how they use the system? What level of backwards/forwards compatibility do we want? How do we handle reliability? Failover? Backups? and so on and so on.
Some of these questions change slightly, since we might end up with "unlimited resources" (i.e. instead of having e.g. 5 engineers on a team who can only get X done per sprint, we effectively have near-limitless compute to use instead) so maybe the answer is "build everything on the wish-list in 1 day" to the "what should we prioritize" type questions?
Interesting times.
My gut is that software engineers will end up as glorified test engineers, coming up with test cases (even if not actually writing the code) and asking the AI to write code until it passes.
Testing in general is quickly being outmoded by formal verification. From my own gut, I see software engineering pivoting into consulting—wherein the deliverables are something akin to domain-specific languages that are tailored to a client's business needs.
Generally the product decisions are not given to the engineers. But yeah, engineers will be tuning, prodding, and poking ai systems to generate the code to match the business requirements.
It is not that you get downvoted because they don’t understand you, it is because you sell your opinion as fact, like an apostle. For example what does it mean that software engineering is solved?
Prophets are always beaten by average citizens, because prophecy is always unpleasant. It can't be otherwise. At the same time, you can't tell right away whether a person is really a prophet, because it becomes known much later. That's probably why beating them (the simplest solution) turns out to be the most observed.
What about brownfield development though? What about vague requirements or cases with multiple potential paths or cases where some technical choices might have important business consequences that shareholders might need to know about? Can we please stop pretending that software engineering happens in a vacuum?
> What about vague requirements or cases with multiple potential paths or cases where some technical choices might have important business consequences that shareholders might need to know about?
If the cost of developing the software is 0, you can just build both.
The thing with vague requirements is that the real problem is that making decisions is hard. There are always tradeoffs and consequences. Rarely is there a truly clear and objective decision. In the end either you or the LLM are guessing what the best option is.
There's cope in the comments about possibility of some software adjacent jobs remaining, which is possible, but the idea of a large number of high paying software jobs remaining by 2030 is a fantasy. Time to learn to be a plumber.
Maybe this one can stop writing a fucking essay in code comments.
I'm now no longer surprised just how consistently all the gemini models overcomplicate coding challenges or just plain get them wrong.
Claude is just consistently spot on. A few salient comments for tricky code instead of incessantly telling me what it's changed and what I might want to do, incorrect assumptions when it has the code or is something we've discussed, changing large amounts of unrelated code (eg styles). I could go on.
Shame I'm too tight to pay for Claude RN though...
From the paper, "Notably, for multiplying two 4 × 4 matrices, applying the algorithm of Strassen recursively results in an algorithm with 49 multiplications, which works over any field...AlphaEvolve is the first method to find an algorithm to multiply two 4 × 4 complex-valued matrices using 48 multiplications."
If you do naive matrix multiplication, you get a sense that you're doing similar work multiple times, but it's hard to quantify just what that duplicated work entails. Compare it to, for example, calculating the size of the union of two sets:
Total size = size(A) + size(B) - size(intersection(A, B))
You have to take out that extra intersection amount because you've counted it twice. What if you could avoid counting it twice in the first place? That's easy, you just iterate over each set once, keeping track of the elements you've already seen.
Strassen's algorithm keeps track of calculations that are needed later on. It's all reminiscent of dynamic programming.
What I find interesting is that it seems the extra savings requires complex values. There must be something going on in the complex plane that is again over-counting with the naive approach.
By googling "4x4 matrices multiplication 48" I ended up on this discussion on math.stackexchange https://math.stackexchange.com/questions/578342/number-of-el... , where in 2019 someone stated "It is possible to multiply two 4×4 matrix A,B with only 48 multiplications.", with a link to a PhD thesis. This might mean that the result was already known (I still have to check the outline of the algorithm).
It seems like you have some misconceptions about Strassen's alg:
1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)
2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.
I think the original poster was referring to the AlphaEvolve variant of Strassen's, not the standard Strassen (with respect to complex values).
> AlphaEvolve achieved up to a 32.5% speedup for the FlashAttention kernel implementation in Transformer-based AI models
> In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.
> And in 20% of cases, AlphaEvolve improved the previously best known solutions
These sound like incredible results. I'd be curious what kind of improvements were made / what the improvements were.
Like, was that "up to a 32.5% speedup" on some weird edge case and it was negligible speed up otherwise? Would love to see the benchmarks.
Remember that GPUs have cache hierarchies and matching block sizes to optimally hit those caches is a big win that you often don't get by default, just because the number of important kernels times important GPUs times effort to properly tune one is greater than what people are willing to do for others for free in open source. Not to mention kernel fusion and API boundaries that socially force suboptimal choices for the sake of clarity and simplicity.
It's a very impressive result, but not magic, but also not cheating!
Absolutely - not arguing that the results are unreasonable to the point of illegitimacy - just curious to see when they perform as well as reported and how well the presented solutions generalize to different test cases - or if it's routing to different solutions based on certain criteria etc.
100%. LLMs are extremely useful for doing obvious but repetitive optimizations that a human might miss.
What it essentially does is a debugging/optimization loop where you change one thing, eval, repeat it again and compare results.
Previously we needed to have a human in the loop to do the change. Of course we have automated hyperparameter tuning (and similar things), but that only works only in a rigidly defined search space.
Will we see LLMs generating new improved LLM architectures, now fully incomprehensible to humans?
If I understood, isn't this software only as useful as the llm powering it is? It sounds like something very useful, but either I'm missing something or it put into a loop and a validator a "please optimize this code". Useful, but maybe not as revolutionary as the underlying llm tech itself
Edit the white paper says this: AlphaEvolve employs an ensemble of large language models. Specifically, we utilize a combination of Gemini 2.0 Flash and Gemini 2.0 Pro. This ensemble approach allows us to balance computational throughput with the quality of generated solutions. Gemini 2.0 Flash, with its lower latency, enables a higher rate of candidate generation, increasing the number of ideas explored per unit of time. Concurrently, Gemini 2.0 Pro, possessing greater capabilities, provides occasional, higher-quality suggestions that can significantly advance the evolutionary search and potentially lead to breakthroughs. This strategic mix optimizes the overall discovery process by maximizing the volume of evaluated ideas while retaining the potential for substantial improvements driven by the more powerful model.
So, I remain of my opinion before. Furthermore, in the paper they don't present it as something extraordinary as some people here say it is, but as an evolution of another existing software, funsearch
> AlphaEvolve is accelerating AI performance and research velocity. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
I'm thinking reading numbers like this is really just slop lately.
FA achieving a 32.5% speed up? Cool.
Why not submit it as a PR to the Flash Attention repo then? Can I read about it more in detail?
I have not read this linked article, but your comment made me recall a discussion about a speed up of CUDA kernels presented by Sakana AI Labs. The researcher Ravid Shwartz Ziv at NYU posted about it on LinkedIn [1], and here is the Twitter post of interest [2]
""" Yesterday's news about Sakana AI Labs provided an important lesson for all of us working with AI agents. Their announcement of an AI system that could supposedly optimize CUDA kernels to run 100x faster initially seemed like exactly the kind of use cases we've been hoping for in AI-assisted development.
Like many others, I was excited about it. After all, isn't this exactly what we want AI to do - help us optimize and improve our technical systems?
However, careful investigation by the community (on Twitter) revealed a different story. What really happened? The AI-generated CUDA kernel appeared to achieve incredible speedups, but the code was inadvertently reusing memory buffers containing previous results, essentially bypassing the actual computation. When properly evaluated, the kernel actually runs about 3x slower than the baseline. """
[1] https://www.linkedin.com/posts/ravid-shwartz-ziv-8bb18761_ye...
[2] https://x.com/main_horse/status/1892473238036631908
lmao this is exactly the kind of stuff I always see from Claude. It’s like adding a Skip() to a test and declaring it works now. “Well it’s a lot faster, I met the criteria of my TODOs cya”
I’ve seen it so much I kinda doubt it was “inadvertent” because they’re like seemingly intentional about their laziness, and will gaslight you about it too.
Same thing for TypeScript type errors… “AI added as any and the problem is fixed”!
“I am a vibe coder, it is your job to check the results”
I assume the Gemini results are JAX/PAX-ML/Pallas improvements for TPUs so would look there for recent PRs
This is great.
But how incremental are these advancements?
I picked one at random (B.2 -- the second autocorrelation inequality). Then, I looked up the paper that produced the previous state of the art (https://arxiv.org/pdf/0907.1379). It turns out that the authors had themselves found the upper bound by performing a numerical search using "Mathematica 6" (p.4). Not only did the authors consider this as a secondary contribution (p.2), but they also argued that finding something better was very doable, but not worth the pain:
"We remark that all this could be done rigorously, but one needs to control the error arising from the discretization, and the sheer documentation of it is simply not worth the effort, in view of the minimal gain." (p.5)
So at least in this case it looks like the advancement produced by AlphaEvolve was quite incremental (still cool!).
Merely from your telling, it seems it is no longer "not worth the effort", as "the effort" has been reduced drastically. This is itself significant.
That right and In fact it’s the core purpose of the tool.
This is complex automation which by definition compresses the solution into a computable process that works more efficiently than the non-automated process
That, in fact, is the revolutionary part - you’re changing how energy is used to solve the problem.
This is exactly why I think the concerns about AI taking people's jobs are overblown. There is not a limited amount of knowledge work to do or things that can be invented or discovered. There's just work that isn't worth the effort, time or money to do right now, it doesn't mean it's not valuable, it's just not cost effective. If you reduce effort, time and money, then suddenly you can do it.
Like even just for programming. I just had an AI instrument my app for tracing, something I wanted to do for a while, but I didn't know how to do and didn't feel like figuring out how to do it. That's not work we were likely to hire someone to do or that would ever get done if the AI wasn't there. It's a small thing, but small things add up.
not worth the time for a human, but if you can throw AI at all of those "opportunities" it adds up substantially because all the chores can be automated.
If this is not the beginning of the take off I don’t know what is.
Interestingly, it seems alphaevolve has already been in use for a year, and it is just now being publicly shown. The paper also mentions that it uses Gemini 2.0 (pro and flash), which creates a situation where Gemini 2.0 was used in a way to train Gemini 2.5.
I don't know if I would call this the fabled "self improving feedback loop", but it seems to have some degree of it. It also begs the question if Alphaevolve was being developed for a year, or has been in production for a year. By now it makes sense to hold back on sharing what AI research gems you have discovered.
If you have the brain power, the compute and control the hardware, what is there to prevent the take off feedback loop? Deepmind is at this point in the timeline uniquely positioned.
> If you have the brain power, the compute and control the hardware, what is there to prevent the take off feedback loop?
In the specific context of improving our AI hardware, for example, it's not as simple as coming up with a good idea -- hardware companies hire thousands of people to improve their designs. Prototypes need to be implemented, verified, quantified, compared thoroughly with the alternatives, then the idea is approved for production, which again leads to a cascade of implementation, verification, etc. until they can reach consumers. In order to make these improvements reach the consumer significantly faster you need to accelerate all of the steps of the very simplified pipeline mentioned earlier.
More generally, an argument can be made that we have been in that take off feedback loop for hundreds of years; it's just that the rate of improvement hasn't been as spectacular as we may have hoped for because each incremental step simply isn't that big of a deal and it takes quite a bit of time to reach the next one.
Running out of improvements after the first pass would prevent that. Who is to say this Alpha Evolve is not already obsolete, having already served its purpose?
Not to sound metaphysical or anything, but dependency on artificial intelligence seems to be something you would find at the peak of Mount Stupid (where the Darwin Awards are kept).
I am late for a chess game, l8r sk8rs.
The fact that all computational problems have a best case complexity bound and there are generally diminishing marginal returns as algorithms approach that bound (i.e. lower hanging fruit are found first). E.g. no amount of intelligence is going to find an algorithm that can sort an array of any arbitrary Comparable type on a single CPU thread faster than O(n*log(n)). There's room for improvement in better adapting algorithms to cache hierarchy etc., but there's only a fixed amount of improvement that can be gained from that.
It is really about autonomy. Can it make changes to itself without human review? If it does, what is the proof such changes won't just stop at some point? All I am seeing here is a coder assist tool, and unsure how helpful inexplicable solutions are in the long run. Could result in an obtuse code base. Is that the point?
Cool, but don't get me wrong, isn't this essentially similar to Google's Co-Scientist, where multiple models are in a loop, passing context back and forth validating things? At its core, it's still a system of LLMs, which is impressive in execution but not fundamentally new.
LLMs are undoubtedly useful at tasks like code "optimisation" and detecting patterns or redundancies that humans might overlook, but this announcement feels like another polished, hypey blog post from Google.
What's also becoming increasingly confusing is their use of the "Alpha" branding. Originally, it was for breakthroughs like AlphaGo or AlphaFold, where there was a clear leap in performance and methodology. Now it's being applied to systems that, while sophisticated, don't really rise to the same level of impact.
edit: I missed the evaluator in my description, but an evaluation method is applied also in Co-Scientist:
"The AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs. Key reasoning steps include self-play–based scientific debate for novel hypothesis generation, ranking tournaments for hypothesis comparison, and an "evolution" process for quality improvement."[0]
[0]: https://research.google/blog/accelerating-scientific-breakth...
Few things are more Google than having two distinct teams building two distinct products that are essentially the same thing.
You can contrast that with Microsoft, where the same team is building the same product with two distinct names.
this is the same team and it's pretty obvious they would apply the same ideas to two different problems that can both benefit from it no?
So we are rebranding the same idea every four months and call it a breakthrough?
No, you're extending the domain to which it is applicable. It's like noting that vaccines are useful for smallpox -- and the flu! Same idea, but different recipes.
pardon "Google's Co-Scientist" ? There are multiple projects called that?
Yep
https://research.google/blog/accelerating-scientific-breakth...
https://engineering.cmu.edu/news-events/news/2023/12/20-ai-c...
For the people awaiting the singularity, lines like this written almost straight from science fiction:
> By suggesting modifications in the standard language of chip designers, AlphaEvolve promotes a collaborative approach between AI and hardware engineers to accelerate the design of future specialized chips."
This just means that it operates on the (debug text form of the) intermediate representation of a compiler.
Not necessarily. Theorem provers provide goals that can serve the same function as "debug text." Instead of interpreting the natural language chosen by the dev who wrote the compiler, these goals provide concrete, type-accurate statements that indicate the progress of an ongoing proof.
I'm referring to what the authors actually claim they did in the paper. They operated on XLA-generated textual IR.
Cf. the second paragraph of 3.3.4 of https://storage.googleapis.com/deepmind-media/DeepMind.com/B...
Sure but remember that this approach only works for exploring an optimization for a function which has a well defined evaluation metric.
You can't write an evaluation function for general "intelligence"...
Honestly it's this line that did it for me:
> AlphaEvolve enhanced the efficiency of Google's data centers, chip design and AI training processes — *including training the large language models underlying AlphaEvolve itself*.
Singularity people have been talking for decades about AI improving itself better than humans could, and how that results in runaway compounding growth of superintelligence, and now it's here.
Most code optimizations end up looking somewhat asymptotic towards a non-zero minimum.
If it takes you a week to find a 1% speedup, and the next 0.7% speedup takes you 2 weeks to find ... well, by using the 1% speedup the next one only takes you 13.86 days. This kind of small optimization doesn't lead to exponential gains.
That doesn't mean it's not worthwhile - it's great to save power & money and reduce iteration time by a small amount. And it combines with other optimizations over time. But this is in no way an example of the kind of thing that the singularity folks envisioned, regardless of the realism of their vision or not.
[flagged]
The singularity has always existed. It is located at the summit of Mount Stupid, where the Darwin Awards are kept. AI is really just psuedo-intelligence; an automated chairlift to peak overconfidence.
I love these confident claims! It sounds like you really know what you are talking about. It's either that or you are projecting. Could you elaborate? I for one find the level of intelligence quite real, I use AIs to do a lot of quite complex stuff for me nowadays. I have an agent that keeps my calendar, schedules appointments with people that want meetings with me, summarizes emails and add these summaries to notion and breaks them up in todo-lists, answers questions about libraries and APIs, writes most of my code (although I do need to hold it's hand and it cannot improve by learning from me).
I'm surprised by how little detail is given about the evolution procedure:
>In AlphaEvolve, the evolutionary database implements an algorithm that is inspired by a combination of the MAP elites algorithm [71] and island-based population models [80, 94].
"inspired by" is doing a lot of heavy lifting in this sentence. How do you choose dimensions of variation to do MAP-elites? How do you combine these two algorithms? How loose is the inspiration? It feels like a lot of the secret sauce is in the answers to these questions, and we get a single paragraph on how the evolution procedure works, which is so vague as to tell us almost nothing.
Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.
As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.
Skimmed quickly the paper. This does not look like RL. It's a genetic algorithm. In a previous life I was working on compbio (protein structure prediction), we built 100s of such heuristic based algorithm (monte carlo simulated annealing, ga..). The moment you have a good energy function (one that provide some sort of gradient), and a fast enough sampling function (llms), you can do looots of cool optmization with sufficient compute.
I guess that's now becoming true with LLMs.
Faster LLMs -> More intelligence
> This does not look like RL. It's a genetic algorithm.
couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.
Genetic algorithm is worse than gradient descent.
If variety is sought, why not beam with nice population statistic.
This depends quite a bit of what you’re trying to optimize.
Gradient descent is literally following the negative of the gradient to minimize a function. It requires a continuous domain, either analytical or numerical derivatives of the cost function, and has well-known issues in narrow valleys and other complex landscapes.
It’s also a local minimization technique and cannot escape local minima by itself.
_Stochastic_ gradient descent and related techniques can overcome some of these difficulties, but are still more or less local minimization techniques and require differentiable and continuous scoring functions.
In contrast, genetic algorithms try to find global minima, do not require differentiable scoring functions, and can operate on both continuous and discrete domains. They have their own disadvantages.
Different techniques for different problems. The field of numerical optimization is vast and ancient for a reason.
You also need a base model that can satisfy the verifier at least some of the time. If all attempts fail, there's nothing there to reinforce. The reinforcement-learning algorithms themselves haven't changed much, but LLMs got good enough on many problems that RL could be applied. So for any given class of problem you still need enough human data to get initial performance better than random.
IMO RL can only solve "easy" problems. The reason RL works now is that unsupervised learning is a general recipe for transforming hard problems into easy ones. But it can't go all the way to solutions, you need RL on top for that. Yann LeCun's "cherry on top" analogy was right.
There's no API or product yet, so it seems unlikely that they made it to a "just works" level of polish?
They are having some success in making it work internally. Maybe only the team that built it can get it to work? But it does seem promising.
Are there platforms that make such training more streamlined? Say I have some definition of success for a given problem and it’s data how do I go about generating said RL model as fast and easily as possible?
We're working on an OSS industrial-grade version of this at TensorZero but there's a long way to go. I think the easiest out of the box solution today is probably OpenAI RFT but that's a partial solve with substantial vendor lock-in.
This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.
As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.
This isn't quite RL, right...? It's an evolutionary approach on specifically labeled sections of code optimizing towards a set of metrics defined by evaluation functions written by a human.
I suppose you could consider that last part (optimizing some metric) "RL".
However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.
I think you mean the general class of algorithms that scale with compute times, RL being the chief example. But yes I agree to that point.
Most things are verifiable, just not with code. I'm not particularly excited for a world where everything is predictable. This is coming from a guy who loves forecasting/prediction modeling too, but one thing I hate about prediction modeling, especially from a hobbyist standpoint is data. Its very hard to get useful data. Investors will literally buy into hospital groups to get medical data for example.
There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.
Yup. Its coming. Any verifiable human skill will be done by ai.
I wonder if evolvable hardware [0] is the next step. In 1996, they optimized an FPGA using a genetic algorithm. It evolved gates disconnected from the rest of the circuit, but were required. The circuit seemed to use the minuscule magnetic fields from these disconnected gates, using the physical substrate rather than logical connections.
[0] https://en.wikipedia.org/wiki/Evolvable_hardware
The paper does not give that many details about the evolution part. Normally, evolutionary algorithms contain some cross-over component where solutions can breed with each other. Otherwise it's better classified as hill climbing / beam search.
There's also 'evolutionary strategy' algorithms that do not use the typical mutation and crossover, but instead use a population of candidates (search samples) to basically approximate the gradient landscape.
I fear it’s not really evolutionary algorithms in the typical sense.
Does this remind anyone else of genetic algorithms?
Is this basically a merge of LLM's with genetic algorithm iteration?
It seemed appropriate to use Gemini to make sure my answers were ideal for getting access to the preview.
Interesting to see Terence Tao in the authors list. I guess he's fully ai pilled now. Did he check the math results?
He is not in the author list, just acknowledged by the authors.
AlphaEvolve is confirming evidence of an intelligence explosion.
The key ingredient for an intelligence explosion is AI accelerating development of AI.
This is it. It’s happening.
I just hope there’s enough time between an actual AI and the “Let’s butcher this to pump out ads” version to publish a definitive version of wikipedia. After a few days with gemini delving into the guts of a spectrum analyser I’m very impressed of the capabilities. But my cynicism gland is fed by the nature of this everything-as-a-service. To run an LLM on your computer, locally, without internet, is just a few clicks. But that’s not the direction these software behemoths are going.
That's possibly a bit too general and an over statement...
Remember this approach only works for exploring an optimization for an already defined behavior of a function which has an accordingly well defined evaluation metric.
You can't write an evaluation function for each individual piece of or general "intelligence"...
We are entering a new era of evolutionary algorithms and LLMs. Reminds me of the idea behind: https://github.com/DivergentAI/dreamGPT
That’s a really cool idea. I often used https://dannymator.itch.io/randomicon to come up with novel ideas, never thought of feeding random words to llm as a way of doing it.
Maybe the actual solution to the interpretability/blackbox problem is to not ask the llm to execute a given task, but rather to write deterministic programs that can execute the task.
This is very neat work! Will be interested in how they make this sort of thing available to the public but it is clear from some of the results they mention that search + LLM is one path to the production of net-new knowledge from AI systems.
> Here, the code between <<<<<<< SEARCH and======= is the exact segment to match in the current program version. The code between======= and >>>>>>> REPLACE is the new segment that will replace the original one. This allows for targeted updates to specific parts of the code.
Anybody knows how they can guarantee uniqueness of searched snipped within code block or is it even possible?
Too bad the code isn't published. I would expect everything from DeepMind to be opensource, except model itself.
In the past AI wasn't really competing with other AI for user dollars. It was more just a bolted on "feature".
Nowadays it makes much more sense to share less.
Finally—something directly relevant to my research (https://trishullab.github.io/lasr-web/). Below are my take‑aways from the blog post, plus a little “reading between the lines.”
- One lesson DeepMind drew from AlphaCode, AlphaTensor, and AlphaChip is that large‑scale pre‑training, combined with carefully chosen inductive biases, enables models to solve specialized problems at—or above—human performance.
- These systems still require curated datasets and experts who can hand‑design task‑specific pipelines.
- Conceptually, this work is an improved version of FunSearch (https://github.com/google-deepmind/funsearch/).
- In broad terms, FunSearch (and AlphaEvolve) follow three core design principles:
Pros- Any general‑purpose foundation model can be coupled with evolutionary search.
- A domain expert merely supplies a Python evaluation function (with a docstring explaining domain‑specific details). Most scientists I've talked with - astronomers, seismologists, neuroscientists, etc. - already maintain such evaluation functions for their own code.
- The output is an interpretable program; even if it overfits or ignores a corner case, it often provides valuable insight into the regimes where it succeeds.
Cons
- Evolutionary search is compute‑heavy and LLM calls are slow unless heavily optimized. In my projects we need ≈ 60 k LLM calls per iteration to support a reasonable number of islands and populations. In equation discovery we offset cost by making ~99 % of mutations purely random; every extra 1 % of LLM‑generated mutations yields roughly a 10 % increase in high‑performing programs across the population.
- Evaluation functions typically undergo many refinement cycles; without careful curation the search may converge to a useless program that exploits loopholes in the metric.
Additional heuristics make the search practical. If your evaluator is slow, overlap it with LLM calls. To foster diversity, try dissimilar training: run models trained on different data subsets and let them compete. Interestingly, a smaller model (e.g., Llama-3 8 B) often outperforms a larger one (Llama‑3 70 B) simply because it emits shorter programs.
Interesting that this wasn't tested on ARC-AGI. Francois has always said he believed program search of this type was the key to solving it. It seems like potentially this approach could do very well.
My thought as well. How well does it translate into arc agi? If it does well then we have a general purpose super intelligence… so maybe agi?
I'm surprised I'm not able to find this out - can some one tell me whether AlphaEvolve involves backprop or not?
I honestly have no idea how AlphaEvolve works - does it work purely on the text level? Meaning I might be able to come up with something like AlphaEvolve with some EC2's and a Gemini API access?
No, the program and prompt databases use a genetic algorithm.
So with just a server an Gemini access + their code I can achieve the same thing? Nice
Interestingly, they improved matrix multiplication and there was a paper on Arxiv a few days ago [1] that also improved matrix multiplication and the only case common to both is <4,5,6> (multiplying 4x5 matrix with 5x6 matrix) and they both improved it from 93 to 90.
[1]: https://arxiv.org/html/2505.05896v1
There’s been a ton of work on multiplying very large matrices. But actually, I have no idea—how well explored is the space of multiplying small matrices? I guess I assume that, like, 4x4 is done very well, and everything else is kind of… roll the dice.
Has scifi covered anything after AI? Or do we just feed the beast with Dyson spheres and this is the end point of the intelligent universe?
Good method to generate synthetic training data, but only works for domains where validation can be scaled up.
anyone else feel out-evolved yet?
Not really, only when looking back at the 60's and 70's when most of the important algorithms I use were invented. For example, LR parsing and A*.
Just wait until the MBA's and politicians learn about this Adam Smith guy. A pipedream now, but maybe in the future schools will be inspired to teach about dialectical reasoning and rediscover Socrates.
[end of snark]
Sorry, I'm getting tired of ad-fueled corporations trying to get me to outsource critical thinking.
Would love for AI to kill the leetcode interview
AI will indeed kill the leetcode interview - because once it replaces human SWEs you don't really need to give leetcode-style brainteasers to any human anymore.
You never needed to.
Not defending it but I think it was more of a test if you have the dedication (and general smarts) to grind them for a few months than software engineering skills.
Similar to hiring good students from famous universities even if most of CS isn't that applicable to day to day programming work, just because it's a signal that they're smart and managed to get through a difficult course.
Yes, and you will never need to.
https://www.interviewcoder.co/ already served that.
Are you sure? From my experience, no AI assistant is fast enough to handle fast-paced questions on the code the candidate just wrote. Also, frequent requests to adjust variable names and deactivating pasting on the page make it extremely laborious for the candidate to get AI to modify the code on the screen.
It will just move the leetcode interview to in-person.
... and make credentials more important. Be careful what you ask for.
that was already solved 2 years back.
I find it quite profound that there is no mention of the generation of corresponding code documentation. Without design diagrams, source and commit comments, etc the resulting code and changes will become incomprehensible unmaintainable. Unless that is somehow the point?
[dead]
[dead]
[dead]
[flagged]
[flagged]
[flagged]
?
https://www.forbes.com/sites/jackkelly/2024/05/31/google-ai-...
2024 was a long time ago.
Software engineering will be completely solved. Even systems like v0 are astounding in their ability to generate code, and are very primitive to whats coming. I get downvoted on HN for this opinion, but its truly going to happen. Any system that can produce code, test the code, and iterate if needed will eventually outperform humans. Add in the reinforcement learning, where they can run the code, and train the model when it gets code generation right, and we are on our way to a whole different world.
"Coding" might be solved, but there is more to software engineering than just churning out code - i.e. what should we build? What are the requirements? Are they right? Whats the other dependencies we want to use - AWS or GCP for example? Why those and not others - whats the reason? How does this impact our users and how they use the system? What level of backwards/forwards compatibility do we want? How do we handle reliability? Failover? Backups? and so on and so on.
Some of these questions change slightly, since we might end up with "unlimited resources" (i.e. instead of having e.g. 5 engineers on a team who can only get X done per sprint, we effectively have near-limitless compute to use instead) so maybe the answer is "build everything on the wish-list in 1 day" to the "what should we prioritize" type questions?
Interesting times.
My gut is that software engineers will end up as glorified test engineers, coming up with test cases (even if not actually writing the code) and asking the AI to write code until it passes.
Testing in general is quickly being outmoded by formal verification. From my own gut, I see software engineering pivoting into consulting—wherein the deliverables are something akin to domain-specific languages that are tailored to a client's business needs.
Indeed, reasoning in the small and reasoning in the large are different skills. Architecture abstracts over code.
Generally the product decisions are not given to the engineers. But yeah, engineers will be tuning, prodding, and poking ai systems to generate the code to match the business requirements.
Everyone will just turn into a problem solver until there are no more problems.
It is not that you get downvoted because they don’t understand you, it is because you sell your opinion as fact, like an apostle. For example what does it mean that software engineering is solved?
Check his profile.
> about: I believe in the creation of a machine god
Sounds about right.
I wonder if he’s a machine himself?
Prophets are always beaten by average citizens, because prophecy is always unpleasant. It can't be otherwise. At the same time, you can't tell right away whether a person is really a prophet, because it becomes known much later. That's probably why beating them (the simplest solution) turns out to be the most observed.
> because prophecy is always unpleasant.
Not necessarily. 'Gospel' is translated as good news. The unpleasant news tends towards those within the power structure that the prophet challenges.
> it is because you sell your opinion as fact.
The guy's making a prediction. Classifying it as some kind of religious zealotry isn't fair to his point or him.
[dead]
it's known idiom, it means: optimal algorithm is found; like in "tic tac toe is solved problem".
If I squint I can see some connection between Go (game) and (Software) Engineering (field).
> Any system that can produce code, test the code, and iterate if needed
That isn't every problem in software engineering.
What about brownfield development though? What about vague requirements or cases with multiple potential paths or cases where some technical choices might have important business consequences that shareholders might need to know about? Can we please stop pretending that software engineering happens in a vacuum?
> What about vague requirements or cases with multiple potential paths or cases where some technical choices might have important business consequences that shareholders might need to know about?
If the cost of developing the software is 0, you can just build both.
The thing with vague requirements is that the real problem is that making decisions is hard. There are always tradeoffs and consequences. Rarely is there a truly clear and objective decision. In the end either you or the LLM are guessing what the best option is.
Isn't what you describe eventually just a context contraint problem?
There's cope in the comments about possibility of some software adjacent jobs remaining, which is possible, but the idea of a large number of high paying software jobs remaining by 2030 is a fantasy. Time to learn to be a plumber.
Some huge percentage of all venture capital in the united states is moving towards solving this problem
Maybe this one can stop writing a fucking essay in code comments.
I'm now no longer surprised just how consistently all the gemini models overcomplicate coding challenges or just plain get them wrong.
Claude is just consistently spot on. A few salient comments for tricky code instead of incessantly telling me what it's changed and what I might want to do, incorrect assumptions when it has the code or is something we've discussed, changing large amounts of unrelated code (eg styles). I could go on.
Shame I'm too tight to pay for Claude RN though...
Just ask it to only add comments on complex parts (or not at all). Prompt engineering.
The comment spam is likely a byproduct of RL, it lets the model dump locally relevant reasoning while writing code.
You can try asking it to not do that, but I would bet it would slightly degrade code quality.
The model likely is doing it more for itself than for you.
You can take the code and give it to another LLM instance and ask it to strip all comments.