OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision

(opencv.org)

352 points | by ternaus 3 days ago

15 comments

plasticeagle 2 hours ago
The thing I love about OpenCV is that it remains hands down the best library for simply loading images and video. I've never even used any of its fancy computer vision features, but if I need to load a video file and look at the pixels - which I did need to do recently for an art project - OpenCV does it in about four lines of code.
ftchd 5 hours ago
> One practical detail is worth knowing. The new engine is CPU-only at the moment, so if you select a non-CPU backend and target (for example CUDA or OpenVINO through setPreferableBackend and setPreferableTarget), you will want the classic engine.
So there's room for even better performance!
[-]
- wongarsu 5 hours ago
  It's certainly a choice to make your headline feature a new ONNX engine, feature a bunch of comparisons how it's better than ONNXRuntime, while casually mentioning on the side that the cool new much faster engine is CPU-only
  Sure, running models on the CPU is very much a thing in computer vision (the benchmarked YOLOv8n has 37M params). But this whole announcement feels more like OpenCV catching up to the modern world, not "The Biggest Leap in Years for Computer Vision"
  Still great, needing fewer libraries is a good thing, but maybe a bit oversold
  [-]
  - VadimPR 4 hours ago
    The release post is AI-written with little human oversight and it shows.
    [-]
    - claytongulick 2 hours ago
      I had to stop reading after: "This is not just another incremental release. OpenCV 5 is a major step forward."
      If a human can't be bothered to write a piece, I can't be bothered to read it.
      [-]
      - danjc 11 minutes ago
        It's not just annoying, it's tiring
      - VulgarExigency 1 hour ago
        The endless deluge of AI prose really wears on the soul once you start noticing it.
      - thin_carapace 24 minutes ago
        i initially adopted this line of thinking. after exposure to arguably valid cases like translated articles, it now seems to me that the most efficient path forward (after first noting AI prose) is to scan past all language and evaluate whether or not useful content is encoded within. theres no benefit to anyone (except those benefitting from societal atrophy) in wasting brain cycles on unnecessary verbosity, however blanket rejection necessarily involves loss of valuable information.
    - trklausss 1 hour ago
      This is what I hate about AI. Not that people use it, it's great to accelerate specific workflows, make less mistakes etc. It's just blindly trusting it and just saying "Make a post about a CV library release, make no mistakes" and calling it a day.
      Where is the human creativity in writing release notes gone?
    - vdfs 2 hours ago
      The illustrations couldn't be any more generic-ai
- nnevatie 4 hours ago
  No one uses ONNXRuntime (nor the new engine in OpenCV 5) in production. For anything performance-sensitive, one would run models under TensorRT, as an example.
  [-]
  - amorroxic 4 hours ago
    Curious on what backs this assertion. As a counterpoint we’ve been running 200+ models in production for more than 5 years - language models, embedding, classifiers, low tens to hundred M params. Traffic in the order of 1-2M requests/day and everything is enabled by onnx with some cgo (or Rust) plumbing on top. What’s your SLA?
  - snovv_crash 4 hours ago
    Strong statement to make when I have at least 2 datapoints contradicting it, in SaaS and embedded/robotics.
  - monster_truck 43 minutes ago
    I've never understood how anyone comes into contact with it and thinks its anything more than an incredible inconvenience masked as the easy way of doing things. Given it a few good shakes for various uses and regretted the time spent each time
  - pzo 2 hours ago
    how are supposed to use TensorRT on iOS, iPadOS, Android or even Web? Production is not only cloud.
  - antonvs 1 hour ago
    We use this in production:
    https://docs.rs/onnxruntime/latest/onnxruntime/
    It’s a Rust wrapper around ONNX Runtime. We currently serve 5+ million inference requests per day for a highly performance-sensitive application, for a long list of major enterprise clients. We don’t use GPUs for inference, because it would be cost-prohibitive. We launch tens of thousands of VMs per day to run these workloads.
  - gunalx 4 hours ago
    Production dosent have to be performance sensitive, so devex may still outcompete the performance differences in some scenarios.
  - OvervCW 3 hours ago
    You can use ONNXRuntime with a TensorRT backend, so one does not exclude the other.
pzo 1 hour ago
Quite a good release although not sure why they invest so much time into their ONNX engine. I don't think they have enough stuff and big pockets to compete with ONNXRuntime, CoreAI, ExecuTorch, LiteRT.
I'm happy they added option for ONNXRuntime. I wish their cv.dnn was mostly that unified wrapper around many different backends (ONNXRuntime, Executorch, LiteRT, CoreAI) and maybe just some tooling around it (performance metrics tools, model downloads etc). Transformers(.js) approach looks better for me.
Wish they also invested more time into better production ready Camera I/O (for mobiles, device/format discovery, manual settings, depthmap support, etc) and better Highgui that could use different backends (skia, webgpu) and on mobiles.
GreenSalem 1 hour ago
AI written release post and it shows...
[-]
- oceansky 22 minutes ago
  I can't say for sure, but there is a suspicious amount of "it's not x, it's y". At least there are no em-dashes.
- _qua 13 minutes ago
  The diagrams definitely look like LLM output as well
arcanine 4 hours ago
They really improved the performance. I tested yolov8 medium segmentation model on intel i7 11th gen cpu.
Opencv 4.11 : ~255ms Opencv 5.0.0 : ~185ms
with the same code.
shelled 3 hours ago
A few years ago I was using OpenCV is a commercial Android SDK (it might still be being used; also because iOS provided almost all of those "needs" ready-made and Android just didn't, neither did Firebase, or Jetpack suites/tools). I was the one who had added it in the SDK. There was a lot I/we could do but as an Android developer (barely any exposure to CV or even C/C++) what I felt we lacked was documentation, a community. We struggled with even shaving off parts that we did not want to ship with our SDK. Speed was such an issue. The problem was someone who just wanted to use the lib (on mobile) a lot of things felt esoteric and out of reach i.e difficult. It didn't have to be.Sadly LLM wasn't at full speed back then, barely useable, not even talked about. Something like this would have been a perfect use case of AI/LLM. A coder, not from the exact/specific field the tool was made in/from, but being able to take full advantage of its capabilities in a nuanced/selective manner.
hbcondo714 2 days ago
> LLMs and VLMs, Running Inside OpenCV…Qwen 2.5, Gemma 3, PaliGemma, and the GPT-2 / GPT-4 family
Why these specific models / versions?
maelito 4 hours ago
Can it detect the speed of the car without any hand-made measurement ?
[-]
- MaxikCZ 3 hours ago
  In pixels/second? Sure!
- monster_truck 42 minutes ago
  Do you know the focal length/AOV of your webcam?
oliveiracwb 4 hours ago
Computer vision was the formative school for many autodidacts. Although I acquired substantial knowledge from articles translated via Power Translator and Babylon (whose outputs closely mirror those of any 2-million-parameter SLM), it was OpenCV that made concepts like convolutions, softmax, minmax, and others finally click for me. I have consistently viewed OpenCV as an intrinsically open, educational, and adaptable library. Any developer can dissect its codebase to extract a specific filter or algorithmic implementation and tailor it to their requirements. It is certainly not cruising at the velocity of trillion-dollar capital. But it holds its altitude. And it will always be there.
leoncos 3 days ago
When I use Codex/Claude to complete a computer vision task, such as extracting assets from an image, OpenCV is their default solution. However, I believe that using YOLO and other methods is outdated. The best solution now is to directly use Nano Banana or other AI image models. A paper has proven that image generation models can perform most CV tasks well. I believe the new OpenCV should become a wrapper for VLM or AI image models.
[-]
- nicolailolansen 5 hours ago
  Whenever you can run a model like Nano Banana or other vision-LLM with the same compute and time performance/restrictions as an OpenCV or YOLO call, you can make that comparison. Until then, I would not call YOLO and OpenCV outdated, it's simply wrong. There's a time and place for big V-LLMs just as there is a time and place for more "traditional" computer vision methods.
- wongarsu 5 hours ago
  I can get great results from a YOLO model with 30M to maybe 300M params. To get decent CV from a LLM 8B params is the absolute minimum, closer to 30B for interesting tasks
  I might be on board about LLMs being the future of OCR (though many would disagree), but for general CV they are very inefficient for very limited benefit
  [-]
  - IanCal 4 hours ago
    They can however be extremely useful for curating training data. Also things like SAM and the DINO (/grounding dino) models.
    Also if they are better then you can also have a flow that’s cheap model -> marginal cases go to more complex thing (and a chain of these).
    The yolo models are really shockingly good for their cost and how well they can work with not much training data as well.
  - charcircuit 1 hour ago
    >for very limited benefit
    Due to how simple they are to work with they will become popular. Compare NLP before and after GPT-3. GPT-3 majorly brought down the complexity and skill needed for doing NLP tasks even if traditional NLP is much much faster. Ultimately ease of development will win out and the industry will work towards optimizing running such LLMs to make it cheap enough to run.
- regularfry 5 hours ago
  I've built hardware with a pi zero 2 + pi cam running a mildly fine-tuned YOLO doing local-only object detection as a USB-OTG device, in a use case where any off-device API calls would have been totally unacceptable, and where the object detection was part of the human interaction loop with a hard ceiling of 300ms on the total interaction time of which the object detection was only one process among many.
  We're not going to fit Nano Banana or anything like it on a device with 512MB RAM and a GPU old enough to be irrelevant, and again, API calls just aren't on the menu.
  [-]
  - Hendrikto 1 hour ago
    > API calls just aren't on the menu
    Even if they were an option, your 300ms latency requirement would exclude them anyway.
- mirsadm 5 hours ago
  That is a very uninformed view. Real time CV is not going to be doing that anytime soon.
- sebmellen 4 hours ago
  Great, let me know when those models can run on-server and process/analyze streams of ID images with less than 100ms of latency. You’ll need to make sure you have a massive set of training data including all manner of slightly blurred and slightly distorted ID cards
  [-]
  - _the_inflator 1 hour ago
    Exactly, and all on an embedded system with quite restrictive settings and no overclocked Intel lastest generation combined with NVIDIA's 10k graphic cards.
    [-]
    - charcircuit 1 hour ago
      Embedded systems can make network calls to powerful, GPU equipped servers.
      [-]
      - ceejayoz 6 minutes ago
        Sure. Claude does that. "Cogitated for 1m 50s" doesn't work for real-time applications.
      - Chu4eeno 6 minutes ago
        They really shouldn't, though.
- serf 2 days ago
  do you realize how many edge or unconnected nodes do OpenCV work?
  some SBC w/ an industrial camera that is doing pick-place or go/no-go operations on a conveyor belt against a singular object type doesn't need a huge image-gen/llm model governing it.
  I mean have you even considered the kind of performance an opencv function can get w/ just mask-matching? I mean even with a fancy YOLO model these answers get thrown out in 1.5-50ms ; this is just a wholly different time scaling.
- Qhemlomo 3 hours ago
  100.000 pictures take a lot of time with LLMs.
  Its a lot better, faster, cheaper to use LLMs for initial labeling together with hand finetuning and then training YOLO with this.
  Training YOLO takes a few hours and is then very fast.
- _the_inflator 1 hour ago
  "When I use..."
  Dude, in business we think in terms of large numbers, internationally easily in billion times processing images. This wouldn't cut it.
  Also, do you buy the mega expensive super individually designed shoes from the best shoemaker there is to march along though some dirt or simply stick to gumboots?
  OpenCV is used behind the scenes for many of the fancy stuff those major AI provider pretend to do. Claude is a huge system and not a LLM anymore.
- kryptiskt 5 hours ago
  If I want to identify and measure the size of round things in my orange sorter machine, I shouldn't have to resort to an unnecessarily complicated solution just because some AI bros can't understand that not everything needs to be an AI model.
  Like, the AI model tools already exist, all that would be accomplished if OpenCV pivoted would be to take it away for people who want to do low-level vision programming. It wouldn't add anything useful to the world, just destroy an excellent library.
- TZubiri 5 hours ago
  I am confused, how can functions that output images help with functions that should take images as input?
  [-]
  - taneq 4 hours ago
    They’re multimodal LLMs trained for image generation. Turns out that if you want to generate images you gotta know what things look like.
    [-]
    - TZubiri 3 hours ago
      That's not helpful my brother. If you have details share them, if not, don't pretend you are more illuminated than me.
      Is the image(text) function reversible? Or are they brute force searching a nearest neighbor like word2vec/hash brute forcing.
      [-]
      - sorenjan 2 hours ago
        Google recently released their paper "Image Generators are Generalist Vision Learners" about exactly this. They fine tuned Nano Banana pro into what they call Vision Banana which can do segmentation etc.
        https://arxiv.org/abs/2604.20329
globalnode 5 hours ago
does this mean im actually able to try object detection in opencv now? i mean i know basic image processing techniques, and i know "in theory" how ML works but ive never really seen a case where i can just say "heres an image now detect all the apples". theres always 1. find a model that has the knowledge, 2. hook it up to an inference engine, 3. do something useful. i always get stuck at 1.
[-]
- wongarsu 5 hours ago
  YOLO has basically solved that for my use cases for a couple years now. If you want labels that are not in the pretrained labels it's also easy to fine-tune, provided you're willing to label 200 or so images
  If you need something less restricted to existing labels (say wanting all the red apples, or all cardboard signs) SAM3 is great, as the sibling comment says
  [-]
  - IanCal 4 hours ago
    > provided you're willing to label 200 or so images
    A quick note to say that this is also a task you can hand to things like gemini.
- fnands 5 hours ago
  That seems to be the way things are going.
  Large general models have taken over in NLP, and (outside of embedded/low latency applications) it seems like they are coming for CV next.
  So you should soon be able to have large generic model that can detect whatever for you.
  It's already pretty much possible with open-vocabulary detectors like SAM3, where you could just prompt it with "Apple": https://ai.meta.com/research/sam3/
- shenberg 5 hours ago
  moondream is a beast
charankilari 3 hours ago
wow its been ages
Magnets 2 hours ago
The announcement itself is pure AI slop
[-]
- thunky 54 minutes ago
  What about the post was not up to your standards?
imJack 4 hours ago
[dead]
pimlottc 1 hour ago
[dead]