Machine learning is hungry. Can you fill it up with something smaller?

Written by Nicholas Stavrinou | Sep 22, 2025 9:00:00 AM

There’s a good chance you’ve heard that AI models are hungry. Training them is a data-intensive project. Absolute masses of raw training data means that we also need incredible processing power and storage space to keep that data around. But the more data, the more accurate the model, right? So the expense to our accounts and to the planet is worth it - right?

Some argue that no, that's actually not true. Models are more accurate when we train them with less, but more high-quality, data. If there's a lot of noise in your training data, your model may end up shaping itself in the image of the junkier bits of your massive dataset.

And maybe it’s not too hard to imagine why. If you’re scraping all sorts of data without a filter for quality, your model will treat everything as fair game for its learning process. Whittling your training data down has its upsides. And not only is less data easier and less expensive to store, but also less energy-consuming to fetch and transmit from its place in storage. The savings to storage expenses, network traffic costs, and the planet are clear.

There are a few different ways to make your data smaller. Dimensionality reduction and feature selection will help reduce the number of properties your dataset has (those columns that can spiral out of control) while keeping the meaningful features of your data. Data that's been stripped of redundancies and noise can make the process of training a model quicker and smoother.

And in any case, if you're storing masses of training data, like spreadsheets or natural language or images, making these things smaller is worthwhile, cutting down on processing time and storage costs. You’ll need to make sure you’re saving these assets losslessly. When it comes to natural language, numbers, and high-precision images, any mistake in the decompressed data can have destructive downstream effects. Accidental word replacements can turn a quality sample into gibberish, while numerical flubs can completely decimate the reliability of data returned by the AI model. Images with unexpected relics lurking among the pixels, likewise, can cause a model to reach false conclusions. Imagine the danger if the model is training on noisy or pixelated medical images, for example, looking to become a reliable online diagnostic assistant for skin conditions.

Clearly there's work to be done preparing our data for AI models. And there's lots of thought being put into data quality right now. There are clear parallels for data compression, a task that's on many of our plates. We care about our data. We want it to be as high-quality as possible. But we also have an interest in making it as small as possible to lower the costs of transmitting, storing, and retrieving that data. So what tips can we take away from AI data management? Here are two quick ones:

Clean your data before you store it. Keeping a data dump of messy data just kicks the can down the road. To be usable for any application, including AI training, data will ultimately need to meet your quality standard. AI is quickly learning the lesson that messy data in equals unreliable results out. Another incentive? Your file size will be even smaller at the time of compression, saving storage space and money with it.
Think about retrieval needs in advance. Many cloud storage models offer different layers of storage, often with playful, descriptive names like "the glacier." Hot storage is for data you'll be retrieving often. Cold storage is for the stuff you need to keep for a rainy day. Cold storage saves energy and money. But some data might need to stay hot. You decide which is which.

Ready to make storing quality data part of your workflow? CompressionX just might be the compression tool you need. CompressionX is revolutionising the way data is stored and shared with a state-of-the-art compression model that transforms data as small as it can go without sacrificing any – and we mean any – quality. We all, just as much as AI, need quality data that’s 100% reliable. Corrupted files just aren’t our thing.

References

https://www.ibm.com/topics/dimensionality-reduction
https://ujangriswanto08.medium.com/harnessing-the-power-of-svd-for-efficient-data-compression-in-machine-learning-bbeccb379a3d
https://medium.com/data-science-at-microsoft/model-compression-and-optimization-why-think-bigger-when-you-can-think-smaller-216ec096f68b
https://www.heavy.ai/technical-glossary/feature-selection

View full post