Building a Stock Trend Predictor for a Market That Has No API

Nepal Stock Exchange has been running since 1994. It lists 359 companies. Thousands of traders use it daily. It has no public API.

That is the first wall you hit when you try to build anything data-driven for NEPSE. No official endpoint, no rate-limited free tier, no developer documentation. Just a website called MeroLagani that aggregates market data for human eyes.

This is the story of building around that constraint. The scraper, the data pipeline, the LSTM training runs, the Django REST backend, the Next.js frontend, and where the validation loss told me the truth I did not want to hear.

The project ran from March 2025 through May 2025, roughly three months from first commit to a working full-stack system.

The scraper

With no API available, I wrote a scraper that hits MeroLagani's live market table every 30 seconds during trading hours.

NEPSE trades Sunday through Thursday, 11:00 AM to 3:00 PM Nepali time. The scraper respects that schedule. It calculates the wait time to the next market open and sleeps, rather than polling and checking continuously.

market_open = dtime(11, 0)
market_close = dtime(15, 0)
allowed_weekdays = {6, 0, 1, 2, 3}  # Sunday to Thursday

Every 30 minutes it pauses for one minute to avoid triggering any server-side rate limiting. Each scrape fetches Symbol, LTP (last traded price), percent change, open, high, low, quantity, previous close, and diff, covering nine fields per stock across all listed symbols.

The data goes into a daily JSON file. One file per trading day.

def get_filename():
    date_str = datetime.now().strftime("%Y-%m-%d")
    return f"market_data_{date_str}.json"

The JSON problem

The daily files came out broken.

The scraper writes incrementally throughout the day. If anything interrupts the process, whether from a network hiccup, a keyboard interrupt, or a power cut,the JSON file ends mid-write and fails to parse. This happened repeatedly.

I ended up writing a repair script that runs json_repair on every raw file before importing:

from json_repair import repair_json

with open('./new-data/market_data_2025-04-29.json', 'r') as f:
    broken_json = f.read()

fixed_json = repair_json(broken_json)

with open('./new-fixed/repair_market_data_2025-04-29-data.json', 'w') as f:
    f.write(fixed_json)

This became a mandatory step in the pipeline. Scrape, repair, import. The repair step should have been unnecessary. The right fix is atomic writes or appending newline-delimited JSON instead of building one large JSON array throughout the day. I noted that and kept moving.

Getting it into PostgreSQL

The import script uses ijson for streaming. The daily files were large enough that loading them fully into memory caused problems on the local machine.

The data also needed cleaning before insert. NEPSE formats numbers with commas (1,234.56). Percentages come with a % suffix. Quantity fields are integers stored as comma-formatted strings.

numeric_cols = ['LTP', 'Open', 'High', 'Low', 'PClose', 'Diff.']
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col].str.replace(',', ''), errors='coerce')

if '% Change' in df.columns:
    df['percent_change'] = pd.to_numeric(
        df['% Change'].str.replace('%', ''), errors='coerce'
    )

Inserts go in batches of 1000 rows using SQLAlchemy's multi method. The final dataset used for training had 1,109,081 rows across 359 symbols, covering around 20 trading days between February and April 2025.

Feature engineering

Before training, I added technical indicators on top of the raw OHLC data:

10-day and 50-day SMA
SMA crossover signal (1 when short SMA is above long SMA, 0 otherwise)
Position differencing to detect actual crossover events
EMA (10-period)
RSI (14-period)

The training pipeline then used RandomForestRegressor to select the top 50 features from all engineered columns, including one-hot encoded symbol columns. The idea was to let the forest figure out which features actually predicted price movement rather than guessing manually.

This decision created a problem I will get to.

Training

Three stacked LSTM layers, dropout 0.2 after each, Adam at lr=0.0005, MSE loss. Input windows of 60 timesteps, 50 features per timestep. Early stopping with patience=5 on validation loss. Training ran on Google Colab for GPU access since the local machine could not handle it.

I ran multiple training iterations between March 9 and April 29, 2025 as I cleaned and expanded the dataset. The loss plots from those runs tell the same story every time.

Train loss drops consistently across epochs. The model is learning. Validation loss oscillates. It spikes, drops, spikes again, and never converges with the training curve. In the worst runs it diverges completely after epoch 13 or 14, spiking to 3-4x the training loss before early stopping kicks in.

This is overfitting. The model memorizes the training data and fails to generalize.

The cause is straightforward: 20 trading days of intraday data is not enough to train an LSTM on time-series patterns across 359 symbols. The model sees each symbol only a handful of times in each window configuration. It has no real basis for generalization.

The feature mismatch

There is a second problem that compounds the first.

During training, RandomForest selects the top 50 features including one-hot encoded symbol columns. Different symbols get selected because different stocks have different volatility patterns and sector behaviors. The model learns symbol-specific patterns as part of its 50-feature input.

At inference time, I only have the 17 real market features for the stock being predicted. The symbol one-hot columns from the training dataset do not exist. So I pad the remaining 33 slots with zeros.

dummy_cols = np.zeros((60, 33))
full_features = np.concatenate([current_features.values, dummy_cols], axis=1)

The model runs, but it is partially blind. The symbol-specific patterns it learned during training contribute nothing at prediction time because those feature positions are always zero.

The README documents this honestly. The fix is to use a fixed symbol embedding at both training and inference. A lookup table of learned vectors per symbol, reconstructed from just the symbol string, without needing the full training dataset present.

The backend

The Django REST backend exposes prediction and market data through a clean set of endpoints under /nepse/. The prediction endpoint takes a stock symbol, pulls the last 60 records from PostgreSQL, engineers the 17 features in real time, pads to 50, runs the LSTM model, and returns a normalized trend value between 0 and 1. Above 0.5 signals an upward trend, below 0.5 signals downward.

@api_view(['POST'])
def predict_stock_price(request):
    stock_symbol = request.data.get("symbol")
    stock_data = list(
        Marketdata.objects.filter(symbol=stock_symbol)
        .order_by('-timestamp')[:60]
    )
    if len(stock_data) < 60:
        return JsonResponse({"error": "Not enough historical data"}, status=400)

Authentication uses JWT via djangorestframework-simplejwt. Admin users can manage market data records directly through the API. Regular users get read access to stock lists and predictions. The 60-record minimum is a hard requirement. New or thinly traded symbols simply cannot be predicted without enough history in the database.

The frontend

The Next.js 15 frontend gives users a stock search interface where they select a symbol, enter the current NEPSE index, and hit predict. The result comes back as a normalized percentage with a Chart.js line chart showing the predicted price movement over time.

There are two separate dashboards: one for regular users showing recent market data and trend indicators, and one for admins with full CRUD access to the market data records and a user management panel. Styling is Tailwind CSS 4. The whole frontend was built between April 5 and May 13, 2025, after the model training work was done.

What the data wall actually looks like

NYSE and NASDAQ have free API tiers through providers like Alpha Vantage and Polygon. Twenty years of daily OHLC for any listed stock, accessible with a single authenticated request.

NEPSE has the official website with a data table, a downloads section with PDFs, and nothing else.

The PDFs are worth mentioning. NEPSE publishes daily trading reports as downloadable PDFs. Parsing those is more work than scraping MeroLagani, but the data is authoritative and less likely to change HTML structure without notice. A more robust version of this scraper would target the PDFs as the primary source and use MeroLagani as a fallback.

What I would change

Atomic file writes in the scraper. Write each scrape snapshot to a separate file, or use newline-delimited JSON and append. Stop building one large JSON array that breaks on interruption.

More data before any training. I started training on weeks of data. That is not enough for an LSTM to learn market patterns across 359 symbols. The right approach is to scrape for at least 6 months before touching the model.

Fixed feature contract between training and inference. Encode each stock as a fixed embedding vector, not as one-hot columns selected by a forest that runs on the full training dataset. The symbol embedding approach means inference needs only the symbol string, not the entire historical dataset.

Separate the scraper from everything else. The scraper should run on a schedule, write to the database, and know nothing about the model. Mixing concerns early made the pipeline harder to maintain.

The full codebase including the scraper, training pipeline, Django REST API, and Next.js frontend is at github.com/Mishansavy/nepse-trend-analayis. The known limitations are in the README. Pull requests open.