In June, the Treasury Department released a request for information (RFI) on the uses, opportunities, and risks of artificial intelligence in the financial services sector. Here’s how they described the purpose of the RFI:
The use of AI is rapidly evolving, and Treasury is committed to continuing to monitor technological developments and their application and potential impacts in financial services to help inform any potential policy deliberations or actions.
In the RFI, Treasury defines artificial intelligence in very broad terms:
A machine-based system that can, for a given set of human-defined objectives, make predictions, recommendations, or decisions influencing real or virtual environments. Artificial intelligence systems use machine and human–based inputs to perceive real and virtual environments; abstract such perceptions into models through analysis in an automated manner; and use model inference to formulate options for information or action.
However, they make it clear that they are less interested in the historical uses of AI in financial services (which is quite extensive) than they are in its cutting-edge future:
For the purposes of this RFI, Treasury is seeking comment on the latest developments in AI technologies and applications, including but not limited to advancements in existing AI (e.g., machine learning models that learn from data and automatically adapt and improve with minimal human interference, rather than relying on explicit programming) and emerging AI technologies including deep learning neutral network such as generative AI and large language models (LLMs).
The “latest developments in AI technologies and applications.” AKA, the scary stuff.
And what, specifically, are they scared of?
GREAT QUESTION!
There are three main areas.
#1: Explainability and Bias
Regulators have worked with financial services providers for decades on model risk management, and there is a very mature set of principles and best practices for identifying and mitigating risks associated with model development, model use, model validation, ongoing monitoring, outcome analysis, and model governance and controls.
That’s great, but as Treasury wisely notes, these established approaches to model risk management may be insufficient for emerging AI technologies, which operate in fundamentally new and complex ways:
These principles are technology-agnostic but may not be applicable to certain AI models and tools. Due to their inherent complexity, however, AI models and tools may exacerbate certain risks that may warrant further scrutiny and risk mitigation measures. This is particularly true in relation to the use of emerging AI technologies.
Financial services regulators want to better understand how companies will ensure that their AI models are sound, fair, and transparent.
#2: Consumer Protection and Privacy
While Treasury acknowledges in the RFI that AI can lead to a reduction in discrimination and better, fairer outcomes for customers, it’s very clear that they are worried about the potential downsides of greater reliance on AI, specifically in the areas of fair lending and unfair, deceptive or abusive acts or practices (UDAAP).
Fraud is another concern. Bad actors are already leveraging advanced AI capabilities to scale up their operations and bypass traditional fraud and scam prevention techniques.
And finally, regulators also see the potential for a massive clash between companies using AI (which requires a massive amount of data to train) and customers’ expectations for data privacy and security:
AI models and tools require great amounts of data to train and operate, creating a demand for more or new sources of data … Treasury noted concerns that the use of alternative data could subject growing amounts of behavior to commercial surveillance. In particular, Treasury noted concerns that the use of data regarding individual behavior – even behavior that is not explicitly related to financial products – in AI models that are used to inform decisions to offer financial products and services, such as credit products, could have unintended spillover effects. Additionally, AI-powered predictive analytics are enabling firms to conjecture about the attributes or behavior of an individual based on analysis of data gathered on other individuals. Such capabilities have the potential to undermine privacy (including the privacy of others) and dilute the power of existing “opt-out” privacy protections, especially when a consumer may not be aware of the information being used about them or the way it may be used.
#3: Third-party Risks
Financial institutions are highly reliant on third parties. This is obviously a well-known risk and it has been a recent area of focus for prudential bank regulators (Hey look! BaaS Island!).
However, like model risk management, Treasury is concerned that existing third-party risk management guidance and best practices may become less useful as AI becomes a more significant part of the third-party service provider landscape.
Specifically, the RFI asks about the challenges of effectively utilizing cutting-edge AI capabilities when those capabilities are only available from a handful of large vendors (OpenAI, Anthropic. Google, Twitter, Meta).
The challenges created by this market concentration are numerous, but they include access (will small financial institutions and fintech startups be able to afford the latest models?), stability (will overuse of a small number of models make the market more volatile?), and antitrust (is it good for a small number of AI companies to dominate the market?).
Three Big Questions
The comment period for the RFI ended in August, and Treasury received more than 100 comments from a wide range of different stakeholders including the CFPB, bank and fintech trade associations, consumer advocates and public policy shops, large tech companies (including Microsoft and Google), and a venture capital firm with an alphabet-inspired moniker.
After reading through Treasury’s RFI as well as many of the comment letters and a few recent speeches from federal bank regulators, I thought it would be useful to summarize three big unanswered (possibly unanswerable) questions about the current and future state of AI in financial services.
To varying extents, all three of these questions revolve around generative AI and large language models (LLMs) because this is the field in artificial intelligence that companies, consumers, and regulators are all (rightly) obsessed with right now.
#1: Will the costs ever come down?
As Ethan Mollick explained in a recent article, the whole game in AI right now is scale:
The larger your model, the more capable it is. Larger models mean they have a greater number of parameters, which are the adjustable values the model uses to make predictions about what to write next. These models are typically trained on larger amounts of data, measured in tokens, which for LLMs are often words or word parts. Training these larger models requires increasing computing power, often measured in FLOPs (Floating Point Operations). FLOPs measure the number of basic mathematical operations (like addition or multiplication) that a computer performs, giving us a way to quantify the computational work done during AI training. More capable models mean that they are better able to perform complex tasks, score better on benchmarks and exams, and generally seem to be “smarter” overall.
This is really important to understand because, on the surface, it’s not entirely intuitive.
For example you might think that building a smaller model, trained on a smaller dataset, for a specific set of tasks might be a viable alternative to pouring an insane amount of resources into the development of a larger more generalized model.
Yeah, not so much:
Bloomberg created BloombergGPT to leverage its vast financial data resources and potentially gain an edge in financial analysis and forecasting. This was a specialized AI whose dataset had large amounts of Bloomberg’s high-quality data, and which was trained on 200 ZetaFLOPs of computing power. It was pretty good at doing things like figuring out the sentiment of financial documents… but it was generally beaten by GPT-4, which was not trained for finance at all. GPT-4 was just a bigger model (the estimates are 100 times bigger, 20 YottaFLOPs) and so it is generally better than small models at everything.
So, the incentive is to try to keep your model on the cutting-edge of capability, costs be damned. Here’s Ethan again:
The story of AI capability has largely been a story of increasing model size, and the sizes of the models follow a generational approach. Each generation requires a lot of planning and money to gather the ten times increase in data and computing power needed to train a bigger and better model. We call the largest models at any given time “frontier models.”
So, for simplicity’s sake, let me propose the following very rough labels for the frontier models. Note that these generation labels are my own simplified categorization to help illustrate the progression of model capabilities, not official industry terminology:
- Gen1 Models (2022): These are models with the capability of ChatGPT-3.5, the OpenAI model that kicked off the Generative AI whirlwind. They require less than 10^25 FLOPs of compute and typically cost $10M or under to train. There are many Gen1 models, including open-source versions.
- Gen2 Models (2023-2024): These are models with the capability of GPT-4, the first model of its class. They require roughly between 10^25 and 10^26 FLOPs of compute and might cost $100M or more to train. There are now multiple Gen2 models.
- Gen3 Models (2025?-2026?): As of now, there are no Gen3 models in the wild, but we know that a number of them are planned for release soon, including GPT-5 and Grok 3. They require between 10^26 and 10^27 FLOPs of compute and a billion dollars (or more) to train.
- Gen4 Models, and beyond: We will likely see Gen4 models in a couple of years, and they may cost over $10B to train. Few insiders I have spoken to expect the benefits of scaling to end before Gen4, at a minimum. Beyond that, it may be possible that scaling could increase a full 1,000 times beyond Gen3 by the end of the decade, but it isn’t clear. This is why there is so much discussion about how to get the energy and data needed to power future models.
The fourth generation of models may cost more than $10 billion to train. That’s insane. No one knows where we’re going to get the electricity, computers, or data to train them. But if the scaling laws for model performance hold, there is absolutely an incentive for the largest companies (and governments) to try to build them.
The challenge with all of this, of course, is accessibility.
At the scale we’re talking about, sooner rather than later, no bank will be able to build their own frontier model. Not even JPMorgan Chase. Same for fintech companies. These models will soon be out of reach for almost everyone.
And I’m guessing that the costs to access the frontier models that companies like Google and OpenAI build will also increase significantly, as they attempt to recoup their enormous R&D investments (OpenAI’s full pivot into a for-profit company will likely be an accelerant here).
Does this matter? Is it important for cutting-edge AI capabilities to be accessible for small companies?
Andreessen Horowitz, self-appointed defender of “little tech”, thinks so. From their comment letter to the Treasury Department:
In our own experience, many companies spend more than 80% of their total capital raised on compute resources. AI startup financing rounds are large, and AI valuations have been on the rise — even so, AI firms tend to overwhelmingly spend much of the money they raise on compute. And even after such bloated expenditures, it is very difficult for startups to produce models that could match models put out by much larger firms such as Google or OpenAI.
AI infrastructure is expensive, and looks set to remain that way in the medium term.
A16z’s answer to this problem is open source:
Open source code is essential to the development of AI models and tools. We strongly believe that open source AI should be allowed to freely proliferate and to compete with both big AI companies and startups.
They go on to argue that open-source AI is starting to flourish (and is likely to continue to do so as long as the government doesn’t interfere):
We currently appear to be in the middle of a boom in the adoption of open source AI. This is a surprising reversal of the situation in 2023, when, by our estimate, up to 80-90% of respondents we surveyed indicated a preference for closed source AI. The majority of this share went to OpenAI. Heading into 2024, however, 46% of our survey’s respondents mentioned that they prefer or strongly prefer open source models. In interviews, nearly 60% of AI leaders noted that they were interested in increasing open source usage or switching when fine-tuned open source models roughly matched performance of closed-source models. In 2024 and onwards, we expect a significant shift in enterprises’ usage of AI towards open source. We understand some enterprises are expressly targeting up to 50% usage from open source models — this is up from approximately 20% in 2023.
This reads, to me, as excessively and unrealistically hopeful on the part of a16z.
The reality is that of the five generation 2 frontier models in the market, only one is open source — Llama 3.1, built by Meta.
Meta’s strategy for making its model somewhat open (developers can see the trained model weights, but not the full training data, and they can modify and build on top of it, with some restrictions) is largely about counter-positioning. Meta doesn’t sell cloud computing services and enterprise software like Google and Microsoft (OpenAI’s sugar daddy), so it can reap the benefits of open source (outside developers will help them improve the model) without suffering the drawbacks (the inability to sell it as a bundled service).
However, as noted above, Llama isn’t truly open source (a16z doesn’t even mention Meta in its comment letter) and I think there’s a real question of how many billions of dollars Mark Zuckerberg is willing to invest to keep Llama on the frontier without a more direct way of monetizing it.
Bottom line — there’s a huge incentive for companies to build, use, and monetize the biggest and most capable models and those models are only going to get more expensive moving forward.
#2: Will we ever be able to explain these models?
Financial services companies need to understand how their AI models work. There are three main reasons for this:
- Soundness — is the model working as designed? Financial services companies (and their regulators) have become accustomed to rule-based systems, in which outcomes are deterministic. In deterministic systems, confirming that everything is working as intended is relatively easy. AI (particularly generative AI) is probabilistic. It’s trying to guess the right answer (and it’s a very good guesser!), but it doesn’t generally know how to think through steps logically in order to know the right answer. This makes figuring out if the model will consistently work as designed much more difficult.
- Fairness — is the model acting on biases that might disadvantage specific groups, especially protected classes under the Equal credit Opportunity Act (ECOA)? This is very difficult question to answer because, in the U.S., companies are generally liable not only for intentional discrimination, but also for unintentional discrimination, in which a policy, practice, or procedure that appears to be neutral on its face has a disproportionate negative impact on a particular group of people based on their protected characteristic, such as race, gender, or national origin. This legal standard is known as disparate impact.
- Transparency — can the reasons for the model’s output be explained in a way that is accurate and easy for a layperson to understand? Answering this question is a legal requirement for lenders in the U.S. when an applicant is denied for credit, and the CFPB has recently made it clear that the processes that lenders have historically used to comply with this “adverse action” requirement are insufficient, especially in a world that is increasingly governed by AI models.
So, the question is, will emerging AI models (LLMs, in particular) ever be fully explainable?
After reviewing the comment letters submitted to the Treasury Department, most of which were written by organizations with a vested interest in making regulators feel as comfortable with AI as possible, I gotta say that I’m not optimistic.
The thing that makes large language models unique — their broad, generalized intelligence and capabilities — also makes them extraordinarily difficult to constrain, or even to explain.
There are certain techniques for constraining their outputs, which have proven to be effective at mitigating the well-known hallucination problem. One example is Retrieval-Augmented Generation (RAG), which is the process of optimizing the output of an LLM, so it references an authoritative knowledge base outside of its training data sources before generating a response.
And there are some techniques for explaining the outputs of complex AI models, including LLMs, which are becoming increasingly popular. One example is post hoc explainability, which is an approach that attempts to analyze and interpret the reasoning process of a trained model after it has generated an output, providing insights into how the model arrived at its outputs.
These techniques (and others like them) are promising, but they have a long way to go before they can deliver the level of accuracy and consistency that existing regulatory guidance requires.
Google essentially admits this in its comment letter to treasury:
Due to the dynamic nature of generative AI models and the different options available, reliance on extensive and ongoing testing focused on outcomes throughout the development and implementation stages of such models should often be prioritized relative to explainability in satisfying regulatory expectations of soundness. To that end, the development of technical metrics and related testing benchmarks should be encouraged. Model “explainability,” while useful for purposes of understanding the specific outputs of AI models, may be less effective or insufficient for establishing whether the model as a whole is sound and fit for purpose.
It’s not clear if regulators will buy this argument, at least when it comes to high-stakes, highly-regulated decisions like who to give a loan to, but there is some evidence that regulators and consumer advocates are slowly shifting their attitudes when it comes to certain long-standing beliefs about fairness in financial services.
As mentioned above, disparate impact is a challenging standard to operate under because it doesn’t require any evidence of intent. If seemingly neutral lending policies lead to a disparate negative impact for a protected class, it’s discrimination, and under ECOA you are not permitted to discriminate against any protected class. Period.
However, in the CFPB’s Fair Lending Report for 2023 (published in June of this year), the bureau introduced a new term when talking about its forward-looking priorities for fair lending supervision and enforcement — less discriminatory alternatives:
The CFPB will also continue to review the fair lending testing regimes of financial institutions. Robust fair lending testing of models should include regular testing for disparate treatment and disparate impact, including searches for and implementation of less discriminatory alternatives using manual or automated techniques.
Less discriminatory alternatives (LDAs for short) represent a potential solution to the never-ending impasse that exists between lenders (who want to optimize their models for predictive performance) and regulators and consumer advocates (who want to prevent discrimination, but who also [I think] recognize how difficult the disparate impact standard is for lenders to deal with).
Here’s how Consumer Reports describes LDAs in its comment letter to Treasury:
Advanced tools and techniques are emerging that enable fine-tuning and debiasing AI/ML models during the development stage to mitigate disparities. Techniques such as adversarial debiasing, joint optimization, or optimized searches for different combinations of variables now enable developers to explore a wide range of alternative models in a much more rapid, efficient manner than was previously feasible. It is now possible to identify alternative models that maintain similar performance levels while minimizing disparity, a win/win for both financial institutions as well as consumers.
Bottom line — as AI models get larger and more complex, it becomes more and more difficult to hold financial institutions to regulatory expectations around soundness, fairness, and transparency, at least as we have traditionally defined them. This will require us to either update those traditional definitions (i.e., a little disparate impact is OK as long as you are constantly searching for and implementing LDAs) or to disallow financial services providers from using these more advanced models to make high-stakes decisions without having a human employee in the loop.
#3: Will we ever allow humans to be out of the loop?
If you’ve been paying attention to the general discussion on AI over the past six months, you likely have noticed a shift in the conversation.
Where once the term “co-pilot” was the favored metaphor for thinking about how to integrate generative AI into business and consumer workflows, today you are much more likely to hear the term “agent”.
What’s the difference between co-pilots and agents?
Michael Hsu, Acting Comptroller of the Currency, gave a speech in June explaining the difference, using the evolution of stock trading as an analogy:
[The] evolution [of stock trading] can be broken down into three phases: (1) inputs, where computers provide information for human traders to consider, (2) co-pilots, where computers support and enable traders to do more faster, and (3) agents, where computers themselves execute trades essentially on behalf of humans according to instructions coded by programmers.
AI appears to be following a similar evolutionary path: where it is used at first to produce inputs to human decision-making, then as a co-pilot to enhance human actions, and finally as an agent executing decisions on its own on behalf of humans.
Autonomous AI agents, which will likely become feasible on a broad scale with the next generation of frontier models (like GPT-5), are an intriguing and terrifying concept in financial services.
On the plus side, autonomous AI agents acting on behalf of customers would empower customers to relentlessly optimize their finances without having to actually lift a finger.
Imagine never having to remember to refinance when interest rates drop or having to put in actual work to find the best rewards credit card for your spending patterns.
On the downside, well, there are lots of potential downsides.
As discussed above, if we can’t ever fully understand how a sufficiently advanced AI model makes a decision, can we really trust it? Especially for high-stakes decisions?
And as I’ve already written about, it will be incredibly difficult for banks to make money if AI agents eliminate customer inertia and laziness. This would seem to be a big safety and soundness risk.
Acting Comptroller Hsu goes even further in his June speech, articulating a plausible worst-case scenario for how a market in which both buyers and sellers are represented by autonomous AI agents could break in catastrophic fashion:
The nightmare paperclip/Skynet scenario for financial stability does not require big leaps of the imagination.
Say an AI agent is programmed to maximize stock returns. It ingests the history of the stock market and identifies a pattern: the most severe stock market crashes are associated with bank runs. Bank runs are associated with high-profile bad news about a bank. Bad news about a bank can be easily spread via viral posts. The AI agent concludes that to maximize stock returns, it should take short positions in a set of banks and spread information to prompt runs and destabilize them.
And finally, as Todd Phillips points out in a new paper, while the term “agent” connotes a legal relationship in which agents have a fiduciary responsibility to act in their principals’ best interests, there is no agreed-upon legal framework for assigning responsibility and liability to autonomous AI agents.
And the early legal test cases that we’ve seen in this area don’t exactly fill my heart with hope:
Because AI agents can act autonomously and take on increasingly sophisticated tasks with ever larger risks to the financial system and consumers, the firms that deploy these agents will be inclined to try to sever their responsibility from the technology they let loose, imposing liability on agents’ users instead. In one of the first cases of its kind to make its way through the court system, Air Canada tried to shield itself from faulty advice its chatbot gave a customer seeking a discount. The airline claimed that their chatbot was a separate legal entity entirely, and that it was ultimately still the customer’s responsibility to locate the correct information elsewhere on the company’s website if its chatbot was incorrect (Moffit v. Air Canada 2024). A Canadian Tribunal ruled in favor of the customer, but Air Canada’s legal logic is a warning flag for consumers and regulators on how firms are likely to position themselves in similar litigation in the future.
Bottom line — while the promise of autonomous AI agents is enormous, for both consumers and companies, the risks of removing humans from decision making loops are myriad and very scary. Until regulators have a better handle on how to ensure market stability and legal accountability when it comes to the use of AI agents, my expectation is that regulators will advise the companies they supervise to move cautiously.