Developer Articles | TechForDev

Latest AI / ML JavaScript Python React Next.js Web Dev DevOps Cloud

Never lose a training run again: a checkpoint-and-resume playbook for ephemeral GPUs

Tanay Joshi1d ago • 6 min read

Never lose a training run again: a checkpoint-and-resume playbook for ephemeral GPUs

▶ Prefer to play with it? There's an interactive version of this article where you can break things....

#machinelearning#python#mlops#learning

5 1

"AI Gateway vs API Gateway: They Solve Different Problems

Sahajmeet Kaur4h ago • 7 min read

"AI Gateway vs API Gateway: They Solve Different Problems

We already had Kong running. Adding AI workloads on top of it made sense - until it didn't. Here's t...

#ai#devops#llm#mlops

2 0

Harvesting a regression test set from gateway logs with a plugin

Marcus Chen2d ago • 4 min read

Harvesting a regression test set from gateway logs with a plugin

TL;DR: Our eval sets went stale because a human wrote the test cases by hand once and never updated....

#mlops#llm#machinelearning#infrastructure

0 0

The SDXL VAE overflow that decoded black images in fp16

Elise Moreau2d ago • 4 min read

The SDXL VAE overflow that decoded black images in fp16

TL;DR: The SDXL VAE decoder pushes activations past 65504, the max value fp16 can hold, so the last....

#pytorch#computervision#machinelearning#mlops

1 0

Data Contracts in Production: Stop Trusting Your Upstream Sources

Gabriel Henrique4d ago • 5 min read

Data Contracts in Production: Stop Trusting Your Upstream Sources

Your upstream data source changed a column type last night. Your pipeline ran at 2am, ingested...

#dataengineering#python#data#mlops

0 0

Benchmarking 5 LLM providers on one eval set, no SDK per vendor

Marcus Chen1d ago • 4 min read

Benchmarking 5 LLM providers on one eval set, no SDK per vendor

TL;DR: We run a 1,200-case eval suite for enterprise agent automation at Nexus Labs. Comparing model...

#machinelearning#llm#mlops#devops

0 0

temperature=0 didn't make our LLM evals reproducible

Marcus Chen2d ago • 4 min read

temperature=0 didn't make our LLM evals reproducible

TL;DR: We set temperature=0 and seed=42 and still got different eval scores on the same 800-prompt.....

#machinelearning#llm#mlops#infrastructure

0 0

Using the channels-last memory format reduced the latency of our conversation backbone by 22%

Elise Moreau1d ago • 4 min read

Using the channels-last memory format reduced the latency of our conversation backbone by 22%

TL;DR: Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cu...

#pytorch#computervision#machinelearning#mlops

1 0

Perplexity held flat after INT4. Task accuracy dropped 7 points.

Marcus Chen6d ago • 4 min read

Perplexity held flat after INT4. Task accuracy dropped 7 points.

TL;DR: We quantized a fine-tuned 14B agent model to INT4 with GPTQ. Perplexity moved 0.04. We almost...

#machinelearning#llm#mlops#pytorch

0 0

The seam our tiled upscaler left on every 4K product render

Elise Moreau6d ago • 4 min read

The seam our tiled upscaler left on every 4K product render

TL;DR: We tile high-res images through our upscaler because a full 4096×4096 pass blows past 24GB of...

#mlops#computervision#pytorch#machinelearning

0 0

Building a Self-Hosted MLOps Platform from Scratch with FastAPI, PostgreSQL, GCS, and Docker

SHIVAM UPADHYAY2d ago • 4 min read

Building a Self-Hosted MLOps Platform from Scratch with FastAPI, PostgreSQL, GCS, and Docker

Introduction Over the past few months, I set out to answer a simple question: What does...

#ai#devops#python#mlops

0 0

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

Suman Nath3d ago • 3 min read

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

Part 4 (finale) of a 4-part series. Three model sizes tied on the same task — so when does bigger ac...

#machinelearning#llm#mlops#ai

0 0

Semantic caching our flaky-test summariser: 58% fewer LLM calls

claire nguyen2d ago • 4 min read

Semantic caching our flaky-test summariser: 58% fewer LLM calls

TL;DR: Our internal flaky-test summariser at Buildkite was firing ~40k LLM calls a day, and most wer...

#sre#devops#llm#mlops

0 0

Machine learning in production: the model is the easy part

Mridul Nagpal1d ago • 3 min read

Machine learning in production: the model is the easy part

A model that scores 95% on your test set feels like the finish line. Then you ship it, and you find....

#ai#machinelearning#mlops

3 1

ML Observability on EKS: Logs, Metrics and Tracing Head-to-Head

Fernando Azevedo1d ago • 11 min read

ML Observability on EKS: Logs, Metrics and Tracing Head-to-Head

Technical analysis comparing the leading observability strategies for ML workloads on EKS: Fluent Bi...

#dataplatforms#eks#mlops#observability

0 0

Deploying MLflow Open-Source Machine Learning Experiment Tracking on Ubuntu 24.04

Sanskriti Harmukh1d ago • 3 min read

Deploying MLflow Open-Source Machine Learning Experiment Tracking on Ubuntu 24.04

MLflow is an open-source platform for managing the machine learning lifecycle — experiment tracking,...

#mlops#docker#devops#ai

6 0

Stop building custom wrappers for your ML models.

Renato Marinho11h ago • 4 min read

Stop building custom wrappers for your ML models.

I spent three days last month building a specialized API wrapper for a simple Scikit-learn model. No...

#ai#mlops#python#productivity

0 0

Portkey Alternative: I Switched Away from Portkey. Here's the Honest Reason Why.

Sahajmeet Kaur5d ago • 7 min read

Portkey Alternative: I Switched Away from Portkey. Here's the Honest Reason Why.

I ran Portkey in production for six months and genuinely liked it — until the acquisition, the cost ...

#ai#llm#devops#mlops

1 0

What Is an Agent Gateway? (And Why Our AI Gateway Stopped Being Enough)

Sahajmeet Kaur1d ago • 8 min read

What Is an Agent Gateway? (And Why Our AI Gateway Stopped Being Enough)

We had an AI gateway running fine for six months. Then we added agents. Here's what broke - and why ...

#ai#agents#devops#mlops

2 0

Position bias in LLM-as-judge flipped 18% of our verdicts

Marcus Chen4h ago • 4 min read

Position bias in LLM-as-judge flipped 18% of our verdicts

TL;DR: Position bias in LLM-as-judge means the model favors whichever answer it reads first. We...

#machinelearning#llm#mlops#aiengineering

0 0

Exploring Midjourney Medical: Revolutionizing Healthcare with AI

Naveen Malothu6d ago • 2 min read

Exploring Midjourney Medical: Revolutionizing Healthcare with AI

Discover the potential of Midjourney Medical in revolutionizing healthcare with AI-generated medical...

#ai#machinelearning#mlops

0 0

Scaling Industrial Intelligence: Architectural Patterns from a Machine Learning Development Company

James Sanderson3d ago • 3 min read

Scaling Industrial Intelligence: Architectural Patterns from a Machine Learning Development Company

While traditional software engineering relies on static, rule-based logical structures, modern...

#ai#mlops#machinelearning#deeplearning

0 0

Machine Learning Engineer vs Data Scientist: Understanding the Difference

Scott McMahan2d ago • 2 min read

Machine Learning Engineer vs Data Scientist: Understanding the Difference

Artificial intelligence teams often include both Data Scientists and Machine Learning Engineers,...

#ai#machinelearning#datascience#mlops

0 0

Real-time AI: Build Your Live Data Stack for Performance

joseph quesada2d ago • 8 min read

Real-time AI: Build Your Live Data Stack for Performance

Unlock real-time intelligence for your AI campaigns and applications. This post details how to build

#aiinfrastructure#realtimedata#dataengineering#mlops

0 0

Tech Articles