Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview)
Jun 27th, 2024 | 25 min read
Introduction
This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .
In this insightful video, we explore the cutting-edge field of machine learning-driven database optimization with Luigi Nardi In this episode of the Cloud Commute podcast.
Key Takeaways
Q: Can machine learning improve database performance? Yes, machine learning can significantly improve database performance. DBtune uses machine learning algorithms to automate the tuning of database parameters, such as CPU, RAM, and disk usage. This not only enhances the efficiency of query execution but also reduces the need for manual intervention, allowing database administrators to focus on more critical tasks. The result is a more responsive and cost-effective database system.
Q: How do machine learning models predict query performance in databases? DBtune employs probabilistic models to predict query performance. These models analyze various metrics, such as CPU usage, memory allocation, and disk activity, to forecast how queries will perform under different conditions. The system then provides recommendations to optimize these parameters, ensuring that the database operates at peak efficiency. This predictive capability is crucial for maintaining performance in dynamic environments.
Q: What are the main challenges in integrating AI-driven optimization with legacy database systems? Integrating AI-driven optimization into legacy systems presents several challenges. Compatibility issues are a primary concern, as older systems may not easily support modern optimization techniques. Additionally, there’s the need to gather sufficient data to train machine learning models effectively. Luigi also mentions the importance of addressing security concerns, especially when sensitive data is involved, and ensuring that the integration process does not disrupt existing workflows.
Q: Can you provide examples of successful AI-driven query optimization in real-world applications? DBtune has successfully applied its technology across various database systems, including Postgres, MySQL, and SAP HANA. For instance, in a project with a major telecom company, DBtune’s optimization algorithms reduced query execution times by up to 80%, leading to significant cost savings and improved system responsiveness. These real-world applications demonstrate the practical benefits of AI-driven query optimization in diverse environments.
In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Chris Engelbert. Ultimately, this allows for a more immersive and insightful listening experience.
Key Learnings
Q: Can machine learning be used for optimization?
Yes, machine learning can be highly effective in optimizing complex systems by analyzing large datasets and identifying patterns that might not be apparent through traditional methods. It can automatically adjust system configurations, predict resource needs, and streamline operations to enhance performance.
simplyblock Insight: While simplyblock does not directly use machine learning for optimization, it provides advanced infrastructure solutions that are designed to seamlessly integrate with AI-driven tools. This allows organizations to leverage machine learning capabilities within a robust and flexible environment, ensuring that their optimization processes are supported by reliable and scalable infrastructure. Q: How does AI-driven query optimization improve database performance?
AI-driven query optimization improves database performance by analyzing system metrics in real-time and adjusting configurations to enhance data processing speed and efficiency. This leads to faster query execution and better resource utilization.
simplyblock Insight: simplyblock’s platform enhances database performance through efficient storage management and high availability features. By ensuring that storage is optimized and consistently available, simplyblock allows databases to maintain high performance levels, even as AI-driven processes place increasing demands on the system. Q: What are the main challenges in integrating AI-driven optimization with legacy database systems?
Integrating AI-driven optimization with legacy systems can be challenging due to compatibility issues, the complexity of existing configurations, and the risk of disrupting current operations.
simplyblock Insight: simplyblock addresses these challenges by offering flexible deployment options that are compatible with legacy systems. Whether through hyper-converged or disaggregated setups, simplyblock enables seamless integration with existing infrastructure, minimizing the risk of disruption and ensuring that AI-driven optimizations can be effectively implemented. Q: What is the relationship between machine learning and databases?
The relationship between machine learning and databases is integral, as machine learning algorithms rely on large datasets stored in databases to train and improve, while databases benefit from machine learning’s ability to optimize their performance and efficiency.
simplyblock Insight: simplyblock enhances this relationship by providing a scalable and reliable infrastructure that supports large datasets and high-performance demands. This allows databases to efficiently manage the data required for machine learning, ensuring that the training and inference processes are both fast and reliable.
Additional Nugget of Information
Q: How is the rise of vector databases impacting the future of machine learning and databases? The rise of vector databases is revolutionizing how large language models and AI systems operate by enabling more efficient storage and retrieval of vector embeddings. These databases, such as pgvector for Postgres, are becoming essential as AI applications demand more from traditional databases. The trend indicates a future where databases are increasingly specialized to handle the unique demands of AI, which could lead to even greater integration between machine learning and database management systems. This development is likely to play a crucial role in the ongoing evolution of both AI and database technologies.
Conclusion
Luigi Nardi showcases how machine learning is transforming database optimization. As DBtune’s founder, he highlights the power of AI to boost performance, cut costs, and enhance sustainability in database management. The discussion also touches on emerging trends like vector databases and DBaaS, making it a must-listen for anyone keen on the future of database technology. Stay tuned for more videos on cutting-edge technologies and their applications.
Full Episode Transcript
Chris Engelbert: Hello, everyone. Welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have Luigi with me. Luigi, obviously, from Italy. I don’t think he has anything to do with Super Mario, but he can tell us about that himself. So welcome, Luigi. Sorry for the really bad joke.
Luigi Nardi: Glad to be here, Chris.
Chris Engelbert: So maybe you start with introducing yourself. Who are you? We already know where you’re from, but I’m not sure if you’re actually residing in Italy. So maybe just tell us a little bit about you.
Luigi Nardi: Sure. Yes, I’m originally Italian. I left the country to explore and study abroad a little while ago. So in 2006, I moved to France and studied there for a little while. I spent almost seven years in total in France eventually. I did my PhD program there in Paris and worked in a company as a software engineer as well. Then I moved to the UK for a few years, did a postdoc at Imperial College London in downtown London, and then moved to the US. So I lived in California, Palo Alto more precisely, for a few years. Then in 2019, I came back to Europe and established my residency in Sweden.
Chris Engelbert: Right. Okay. So you’re in Sweden right now.
Luigi Nardi: That’s correct.
Chris Engelbert: Oh, nice. Nice. How’s the weather? Is it still cold?
Luigi Nardi: It’s great. Everybody thinks that Sweden has very bad weather, but Sweden is a very, very long country. So if you reside in the south, actually, the weather is pretty decent. It doesn’t snow very much.
Chris Engelbert: That is very true. I actually love Stockholm, a very beautiful city. All right. One thing you haven’t mentioned, you’re actually the founder and CEO of DBtune. So you left out the best part guess. Maybe tell us a little bit about DBtune now.
Luigi Nardi: Sure. DBtune is a company that is about four years old now. It’s a spinoff from Stanford University and the commercialization of about a decade of research and development in academia. We were working on the intersection between machine learning and computer systems, specifically the use of machine learning to optimize computer systems. This is an area that in around 2018 or 2019 received a new name, which is MLSys, machine learning and systems. This new area is quite prominent these days, and you can do very beautiful things with the combination of these two pieces. DBtune is specifically focusing on using machine learning to optimize computer systems, particularly in the computer system area. We are optimizing databases, the database management systems more specifically. The idea is that you can automate the process of tuning databases. We are focusing on the optimization of the parameters of the database management systems, the parameters that govern the runtime system. This means the way the disk, the RAM, and the CPU interact with each other. You take the von Neumann model and try to make it as efficient as possible through optimizing the parameters that govern that interaction. By doing that, you automate the process, which means that database engineers and database administrators can focus on other tasks that are equally important or even more important. At the same time, you get great performance, you can reduce your cloud costs as well. If you’re running in the cloud in an efficient way, you can optimize the cloud costs. Additionally, you get a check on your greenops, meaning the sustainability aspect of it. So this is one of the examples I really like of how you can be an engineer and provide quite a big contribution in terms of sustainability as well because you can connect these two things by making your software run more efficiently and then scaling down your operations.
Chris Engelbert: That is true. And it’s, yeah, I’ve never thought about that, but sure. I mean, if I get my queries to run more efficient and use less compute time and compute power, huh, that is actually a good thing. Now I’m feeling much better.
Luigi Nardi: I’m feeling much better too. Since we started talking a little bit more about this, we have a blog post that will be released pretty soon about this very specific topic. I think this connection between making software run efficiently and the downstream effects of that efficiency, both on your cost, infrastructure cost, but also on the efficiency of your operations. It’s often underestimated, I would say.
Chris Engelbert: Yeah, that’s fair. It would be nice if you, when it’s published, just send me over the link and I’m putting it into the show notes because I think that will be really interesting to a lot of people. As he said specifically for developers that would otherwise have a hard time having anything in terms of sustainability. You mentioned database systems, but I think DBtune specifically is focused on Postgres, isn’t it?
Luigi Nardi: Right. Today we are focusing on Postgres. As a proof of concept, though, we have applied similar technology to five different database management systems, including relational and non-relational systems as well. So we were, a little while ago, we wanted to show that this technology can be used across the board. And so we play around with MySQL, with FoundationDB, which is the system behind iCloud, for example, and many of the VMware products. And then we have RocksDB, which is behind your Instagram and Facebook and so on. Facebook, very pressing that open source storage system. And things like SAP HANA as well, we’ve been focusing on that a little bit as well, just as a proof of concept to show that basically the same methodology can apply to very different database management systems in general.
Chris Engelbert: Right. You want to look into Oracle and take a chunk of their money, I guess. But you’re on the right track with SAP HANA. It’s kind of on the same level. So how does that work? I think you have to have some kind of an agent inside of your database. For Postgres, you’re probably using the stats tables, but I guess you’re doing more, right?
Luigi Nardi: Right. This is the idea of, you know, observability and monitoring companies. They mainly focus on gathering all this metrics from the machine and then getting you a very nice visualization on your dashboard. As a user, you would look at these metrics and how they evolve over time, and then they help you guide the next step, which is some sort of manual optimization of your system. We are moving one step forward and we’re trying to use those metrics automatically instead of just giving them back to the user. So we move from a passive monitoring approach to an active approach where the metrics are collected and then the algorithm will help you also to automatically change the configuration of the system in a way that it gets faster over time. And so the metrics that we look at usually are, well, the algorithm itself will gather a number of metrics to help it to improve over time. And this type of metrics are related to, you know, your system usage, you know, CPU memory and disk usage. And other things, for example, latency and throughput as well from your Postgres database management system. So using things like pg_stat_statements, for example, for people that are a little more familiar with Postgres. And by design, we refrain from looking inside your tables or looking specifically at your metadata, at your queries, for example, we refrain from that because it’s easier to basically, you know, deploy our system in a way that it’s not dangerous for your data and for your privacy concerns and things like that.
Chris Engelbert: Right. Okay. And then you send that to a cloud instance that visualizes the data, just the simple stuff, but there’s also machine learning that actually looks at all the collected data and I guess try to find pattern. And how does that work? I mean, you probably have a version of the query parser, the Postgres query parser in the backend to actually make sense of this information, see what the execution plan would be. That is just me guessing. I don’t want to spoil your product.
Luigi Nardi: No, that’s okay. So the agent is open source and it gets installed on your environment. And anyone fluent in Python can read that in probably 20 minutes. So it’s pretty, it’s not massive. It’s not very big. That’s what gets connected with our backend system, which is running in our cloud. And the two things connect and communicate back and forth. The agent reports the metrics and requests what’s the next recommendation from the optimizer that runs in our backend. The optimizer responds with a recommendation, which is then enabled in the system through the agent. And then the agent also starts to measure what’s going on on the machine before reporting these metrics back to the backend. And so this is a feedback loop and the optimizer gets better and better at predicting what’s going on on the other side. So this is based on machine learning technology and specifically probabilistic models, which I think is the interesting part here. By using probabilistic models, the system is able to predict the performance for a new guess, but also predict the uncertainty around that estimate. And that’s, I think, very powerful to be able to combine some sort of prediction, but also how confident you are with respect to that prediction. And those things are important because when you’re optimizing a computer system, of course, you’re running this in production and you want to make sure that this stays safe for the system that is running. You’re changing the system in real time. So you want to make sure that these things are done in a safe way. And these models are built in a way that they can take into account all these unpredictable things that may otherwise book in the engineer system.
Chris Engelbert: Right. And you mentioned earlier that you’re looking at the pg_stat_statements table, can’t come up with the name right now. But that means you’re not looking at the actual data. So the data is secure and it’s not going to be sent to your backend, which I think could be a valid fear from a lot of people like, okay, what is actually being sent, right?
Luigi Nardi: Exactly. So Chris, when we talk with large telcos and big banks, the first thing that they say, what are you doing to my data? So you need to sit down and meet their infosec teams and explain to them that we’re not transferring any of that data. And it’s literally just telemetrics. And those telemetrics usually are not sensitive in terms of privacy and so on. And so usually there is a meeting that happens with their infosec teams, especially for big banks and telcos, where you clarify what is being sent and then they look at the source code because the agent is open source. So you can look at the open source and just realize that nothing sensitive is being sent to the internet.
Chris Engelbert: Right.
Luigi Nardi: And perhaps to add one more element there. So for the most conservative of our clients, we also provide a way to deploy this technology in a completely offline manner. So when everybody’s of course excited about digital transformations and moving to the cloud and so on, we actually went kind of backwards and provided a way of deploying this, which is sending a standalone software that runs in your environment and doesn’t communicate at all to the internet. So we have that as an option as well for our users. And that supports a little harder for us to deploy because we don’t have direct access to that anymore. So it’s easy for us to deploy the cloud-based version. But if you, you know, in some cases, you know, there is not very much you can do that will not allow you to go through the internet. There are companies that don’t buy Salesforce for that reason. So if you don’t buy Salesforce, you probably not buy from anybody else on the planet. So for those scenarios, that’s what we do.
Chris Engelbert: Right. So how does it work afterwards? So the machine learning looks into the data, tries to find patterns, has some optimization or some … Is it only queries or does it also give me like recommendations on how to optimize the Postgres configuration itself? And how does that present those? I guess they’re going to be shown in the UI.
Luigi Nardi: So we’re specifically focusing on that aspect, the optimization of the configuration of Postgres. So that’s our focus. And so the things like, if you’re familiar with Postgres, things like the shared buffers, which is this buffer, which contains the copy of the data from tables from the disk and keep it a local copy on RAM. And that data is useful to keep it warm in RAM, because when you interact with the CPU, then you don’t need to go all the way back to disk. And so if you go all the way back to disk, there is an order of magnitude more like delay and latency and slow down based on that. So you try to keep the data close to where it’s processed. So trying to keep the data in cache as much as possible and share buffer is a form of cache where the cache used in this case is a piece of RAM. And so sizing these shared buffers, for example, is important for performance. And then there are a number of other things similar to that, but slightly different. For example, in Postgres, there is an allocation of a buffer for each query. So each query has a buffer which can be used as an operating memory for the query to be processed. So if you’re doing some sort of like sorting, for example, in the query that small memory is used again. And you want to keep that memory close to the CPU and specifically the workman parameter, for example, is what helps with that specific thing. And so we optimize all this, all these things in a way that the flow of data from disk to the registers of the CPU, it’s very, very smooth and it’s optimized. So we optimize the locality of the data, both spatial and temporal locality if you want to use the technical terms for that.
Chris Engelbert: Right. Okay. So it doesn’t help me specifically with my stupid queries. I still have to find a consultant to fix that or find somebody else in the team.
Luigi Nardi: Yeah, for now, that’s correct. We will probably focus on that in the future. But for now, the way you usually optimize your queries is that you optimize your queries and then if you want to see what’s the actual benefit, you should also optimize your parameters. And so if you want to do it really well, you should optimize your queries, then you go optimize your parameters and go back optimize again your queries, parameters and kind of converge into this process. So now that one of the two is fully automated, you can focus on the queries and, you know, speed up the process of optimizing the queries by a large margin. So to in terms of like benefits, of course, if you optimize your queries, you will write your queries, you can get, you know, two or three order of magnitude performance improvement, which is really, really great. If you optimize the configuration of your system, you can get, you know, an order of magnitude in terms of performance improvement. And that’s, that’s still very, very significant. Despite what many people say, it’s possible to get an order of magnitude improvement in performance. If your system by baseline, it’s fairly, it’s fairly basic, let’s say. And the interesting fact is that by the nature of Postgres, for example, the default configuration of Postgres needs to be pretty conservative because Postgres needs to be able to run on big server machines, but also on smaller machines. So the form factor needs to be taken into account when you define the default configuration of Postgres. And so by that fact, it needs to be pretty conservative. And so what you can observe out there is that this problem is so complex that people don’t really change the default configuration of Postgres when they run on a much bigger instance. And so there is a lot of performance improvement that can be obtained by changing that configuration to a better-suited configuration. And you have the point of doing this through automation and through things like DBtune is that you can then refine the configuration of your system specifically for the specific use case that you have, like your application, your workload, the machine size, and all these things are considered together to give you the best outcome for your use case, which is, I think, the new part, the novelty of this approach, right? Because if you’re doing this through some sort of heuristics, they usually don’t really get to cover all these different things. And there will always be kind of super respect to what you can do with an observability loop, right?
Chris Engelbert: Yeah, and I think you mentioned that a lot of people don’t touch the configuration. I think there is the problem that the Postgres configuration is very complex. A lot of parameters depend on each other. And it’s, I mean, I’m coming from a Java background, and we have the same thing with garbage collectors. Optimizing a garbage collector, for every single algorithm you have like 20 or 30 parameters, all of them depend on each other. Changing one may completely disrupt all the other ones. And I think that is what a lot of people kind of fear away from. And then you Google, and then there’s like the big Postgres community telling you, “No, you really don’t want to change that parameter until you really know what you’re doing,” and you don’t know, so you leave it alone. So in this case, I think something like Dbtune will be or is absolutely amazing.
Luigi Nardi: Exactly. And, you know, if you spend some time on blog posts learning about the Postgres parameters you get that type of feedback and takes a lot of time to learn it in a way that you can feel confident and comfortable in changes in your production system, especially if you’re working in a big corporation. And the idea here is that at DBtune we are partnered with leading Postgres experts as well. Magnus Hagander, for example, we see present of the Postgres Europe organization, for example, it’s been doing this manual tuning for about two decades and we worked very closely with him to be able to really do this in a very safe manner, right. You should basically trust our system to be doing the right thing because it’s engineering a way that incorporates a lot of domain expertise so it’s not just machine learning it’s also about the specific Postgres domain expertise that you need to do this well and safely.
Chris Engelbert: Oh, cool. All right. We’re almost out of time. Last question. What do you think it’s like the next big thing in Postgres and databases, in cloud, in db tuning.
Luigi Nardi: That’s a huge question. So we’ve seen all sorts of things happening recently with, of course, AI stuff but, you know, I think it’s, it’s too simple to talk about that once more I think you guys covered those type of topics a lot. I think what’s interesting is that there is there is a lot that has been done to support those type of models and using for example the rise of vector databases for example, which was I think quite interesting vector databases like for example the extension for Postgres, the pgvector was around for a little while but in last year you really saw a huge adoption and that’s driven by all sort of large language models that use this vector embeddings and that’s I think a trend that will see for a little while. For example, our lead investor 42CAP, they recently invested in another company that does this type of things as well, Qdrant for example, and there are a number of companies that focus on that Milvus and Chroma, Zilliz, you know, there are a number of companies, pg_vectorize as well by the Tembo friends. So this is certainly a trend that will stay and for a fairly long time. In terms of database systems, I am personally very excited about the huge shift left that is happening in the industry. Shift left the meaning all the databases of service, you know, from Azure flexible server Amazon RDS, Google Cloud SQL, those are the big ones, but there are a number of other companies that are doing the same and they’re very interesting ideas, things that are really, you know, shaping that whole area, so I can mention a few for example, Tembo, even EnterpriseDB and so on that there’s so much going on in that space and in some sort, the DBtune is really in that specific direction, right? So helping to automate more and more of what you need to do in a database when you’re operating at database. From a machine learning perspective, and then I will stop that Chris, I think we’re running out of time. From machine learning perspective, I’m really interested in, and that’s something that we’ve been studying for a few years now in my academic team, with my PhD students. The, you know, pushing the boundaries of what we can do in terms of using machine learning for computer systems and specifically when you get computer systems that have hundreds, if not thousands of parameters and variables to be optimized at the same time jointly. And we have recently published a few pieces of work that you can find on my Google Scholar on that specific topic. So it’s a little math-y, you know, it’s a little hard to maybe read them parts, but it’s quite rewarding to see that these new pieces of technology are becoming available to practitioners and people that work on applications as well. So that perhaps the attention will move away at some point from full LLMs to also other areas in machine learning and AI that are also equally interesting in my opinion.
Chris Engelbert: Perfect. That’s, that’s beautiful. Just send me the link. I’m happy to put it into the show note. I bet there’s quite a few people that would be really, really into reading those things. I’m not big on mathematics that’s probably way over my head, but that’s, that’s fine. Yeah, I was that was a pleasure. Thank you for being here. And I hope we. Yeah, I hope we see each other somewhere at a Postgres conference we just briefly talked about that before the recording started. So yeah, thank you for being here. And for the audience, I see you, I hear you next week or you hear me next week with the next episode. And thank you for being here as well.
Luigi Nardi: Awesome for the audience will be at the Postgres Switzerland conference as sponsors and we will be giving talks there so if you come by, feel free to say hi, and we can grab coffee together. Thank you very much.
Chris Engelbert: Perfect. Yes. Thank you. Bye bye.
You may also like:
Why would you run PostgreSQL in Kubernetes, and how?
The True Cost of Database Management: RDS vs. EKS
How to reduce AWS cloud costs with AWS marketplace products?