When people talk about AI breakthroughs, they mention ChatGPT, DALL-E, and AlphaFold. What they rarely mention is spreadsheets. But tabular data - the rows and columns in databases, spreadsheets, and time-series systems - is where most real-world decisions are made. And it's one of the hardest problems in AI.
What Is Tabular Data?
Tabular data is structured information in rows and columns. It's everywhere:
- Finance: Stock prices, transaction records, credit scores
- Healthcare: Patient records, lab results, clinical trials
- Business: Sales data, customer databases, inventory systems
- Science: Experimental measurements, sensor readings, survey data
# Example: Financial time-series (tabular data)
timestamp,open,high,low,close,volume,rsi,macd
2026-01-28 09:30,185.50,186.20,185.30,185.90,1250000,58.3,0.42
2026-01-28 09:31,185.90,186.45,185.80,186.30,980000,59.1,0.51
2026-01-28 09:32,186.30,186.50,186.10,186.25,1100000,58.7,0.48
# Each row is an observation
# Each column is a feature
# The structure matters
If you've ever worked with Excel, SQL databases, or pandas DataFrames, you've worked with tabular data. It's the backbone of enterprise software and analytical decision-making.
Why Doesn't Anyone Talk About It?
The AI hype cycle favors flashy demonstrations. Generating images from text prompts is visually impressive. Chatbots feel magical. Predicting the next value in a spreadsheet? Less exciting to demo, but often more valuable.
There's also a perception issue. Deep learning revolutionized image and language tasks so dramatically that people assume the same approaches work for all data types. They don't.
The techniques that made LLMs and image generators work (transformers, massive scale, self-supervision) don't automatically transfer to tabular data. The problems are fundamentally different.
Why Is It So Hard?
1. No Natural Pretrained Structure
Images have natural structure: pixels next to each other are related. Text has natural structure: words near each other form sentences. These structures enable self-supervised pretraining on massive datasets.
Tabular data has no universal structure. A column could be categorical (red, blue, green) or continuous (185.50, 186.30). The relationship between column A and column B varies completely between datasets. There's no "ImageNet for spreadsheets."
# Image: Local structure is universal
pixel[i][j] is related to pixel[i+1][j] # Always true
# Text: Sequential structure is universal
word[i] is related to word[i+1] # Always true
# Tabular: Structure varies per dataset
column[A] vs column[B]? # Depends on the specific data
2. Feature Heterogeneity
In an image, every pixel is fundamentally the same type: a color value. In a table, column 1 might be a date, column 2 a price, column 3 a category, column 4 a free-text field. Each requires different handling.
Deep learning loves homogeneous data. Tabular data is the opposite.
3. Signal-to-Noise Ratio
In image classification, the signal is strong. A cat looks like a cat in most pictures. In financial prediction, the signal is weak and buried in noise. Markets are partially efficient - if patterns were obvious, they'd be arbitraged away.
Financial time-series have signal-to-noise ratios close to zero. The predictable component is tiny compared to the random variation. This is fundamentally harder than classifying cats vs dogs.
4. Non-Stationarity
Cats looked the same in 2010 as they do in 2026. Markets don't. The patterns that worked last year might not work this year. Regimes change. Correlations break.
This non-stationarity means models need to continuously adapt. A one-time training run isn't enough.
5. Adversarial Dynamics
When you train an image classifier, the images don't change to fool your model. Financial markets are different. Other participants adapt. If a pattern becomes known, it gets traded away.
You're not just predicting a static system - you're playing against other agents who are also trying to predict and profit.
Why LLMs Don't Solve This
A common question: "Can't LLMs handle tabular data? They're good at everything else."
No. Here's why:
Tokenization Mismatch
LLMs tokenize input into subword units. The number "185.50" becomes a sequence of tokens like ["1", "85", ".", "50"]. The model doesn't "know" that 185.50 is between 185.49 and 185.51 on a number line. It just sees character patterns.
# What an LLM sees
"The price is $185.50" -> ["The", " price", " is", " $", "185", ".", "50"]
# What a numerical model sees
185.50 -> continuous value with relationships to other values
# The LLM representation loses numerical meaning
No Probabilistic Output
LLMs output text. They can generate "the price will go up" or "70% chance of increase" but these are text strings, not actual probability distributions. There's no guarantee the "70%" reflects actual model confidence.
Wrong Inductive Biases
LLMs are optimized for linguistic patterns - grammar, semantics, reasoning about text. Numerical relationships in tabular data require different inductive biases. Using an LLM for tabular prediction is like using a hammer for screws. It might sort of work, but you're using the wrong tool.
What Actually Works
Tabular prediction requires specialized approaches:
Gradient Boosted Trees
Methods like XGBoost, LightGBM, and CatBoost remain competitive on tabular data. They handle feature heterogeneity well and often outperform deep learning on small-to-medium datasets.
Purpose-Built Tabular Neural Networks
Research is active on neural architectures specifically designed for tabular data - TabNet, TabTransformer, FT-Transformer. These incorporate structural priors that make sense for rows-and-columns data.
Ensembles and Uncertainty
Given the noise and non-stationarity, ensemble methods and explicit uncertainty quantification become essential. Single models are brittle. Ensembles that quantify disagreement provide more robust predictions.
At Kunkafa, we use ensemble approaches with multiple specialized tabular architectures. Each model contributes to both the prediction and the uncertainty estimate.
Why This Matters Beyond Finance
Financial prediction is our application domain, but the problem is broader. Tabular data prediction affects:
- Healthcare: Predicting patient outcomes from medical records
- Manufacturing: Predictive maintenance from sensor data
- Logistics: Demand forecasting from sales history
- Insurance: Risk assessment from applicant data
These are high-stakes domains where better predictions directly impact lives and businesses. Yet they don't get the attention that chatbots and image generators receive.
The State of Research
Tabular deep learning is an active research area, but progress is slower than in language or vision. Some reasons:
- Less benchmark standardization (each dataset is different)
- Harder to scale (no ImageNet equivalent)
- Less venture capital attention (not as demo-able)
- Traditional methods (boosting) remain strong baselines
This creates an interesting dynamic. The field is underexplored relative to its importance. Companies doing serious work here have an opportunity - but it requires domain expertise, not just scaling up existing architectures.
Kunkafa's Approach
We started Kunkafa specifically because tabular prediction is hard and underserved. Our approach:
- Domain-specific architectures: Built for time-series OHLCV data, not adapted from language models
- Native uncertainty: Probabilistic output is fundamental, not bolted on
- Continuous adaptation: Models retrain to handle non-stationarity
- Ensemble methods: Multiple architectures combined for robustness
Financial markets are our proving ground because they're the hardest test case: low signal-to-noise, non-stationary, adversarial. If we can build reliable predictions here, the methods transfer to other tabular domains.
The Bottom Line
The AI revolution has unequally distributed attention. Language and vision have received massive investment and made dramatic progress. Tabular data - where most real-world decisions happen - remains a harder, less glamorous problem.
The next time you see an AI product for spreadsheet-style prediction, ask what architecture it uses. If it's just a language model with a different interface, be skeptical. The problem requires specialized approaches.
Tabular data prediction isn't solved. But for those of us working on it, that's exactly what makes it interesting.