About the Company

We are hiring a Computer Vision Engineer to build the visual understanding layer for an AI-native product.

About the Role

You will work on systems that interpret live visual input, understand what is happening in real time, and turn messy visual data into reliable product context. This could include screen understanding, object and UI detection, OCR, tracking, segmentation, visual embeddings, video understanding, and multimodal reasoning. This is a high-ownership engineering role for someone who enjoys taking ambiguous product needs, choosing the right technical approach, and shipping production systems that are fast, reliable and useful.

Responsibilities

Build real-time or near-real-time computer vision pipelines for live visual input.
Detect and interpret objects, UI states, entities, scenes, changes and other relevant visual signals.
Develop tracking and temporal reasoning systems that understand what changes frame-to-frame.
Evaluate and combine OCR, object detection, segmentation, visual embeddings, VLMs, classical CV and custom models.
Optimise inference for latency, throughput, model size, GPU memory and production reliability.
Design clean APIs and event streams that expose visual signals to product, reasoning, retrieval or automation systems.
Create evaluation loops, confidence thresholds and fallback behaviours for uncertain visual outputs.
Work closely with product and engineering teams to turn prototype models into robust user-facing systems.
Help shape the early architecture for a vision system that can scale across use cases and environments.

Qualifications

Strong hands-on experience building computer vision systems in production or production-like environments.
Experience with real-time or low-latency visual processing.
Strong Python skills and experience with at least one performance-oriented stack such as C++, CUDA, TensorRT, OpenCV, ONNX Runtime, OpenVINO, Metal or similar.
Experience with one or more of: object detection, segmentation, OCR, visual embeddings, tracking, scene understanding, video understanding, SLAM, 3D/spatial computing or multimodal/VLM systems.
Strong judgement around latency, throughput, batching, model size, GPU memory, confidence scoring and runtime behaviour.
Ability to prototype quickly, measure performance, improve systematically and ship.
Comfort working in ambiguous product environments where the right technical approach may need to be discovered.
Pragmatic engineering instincts: you care about model quality, but also about whether the system works reliably for users.

Required Skills

Experience with screen capture, streaming video, WebRTC, media pipelines or low-latency desktop/mobile applications.
Experience deploying CV models on constrained hardware, edge devices, mobile GPUs/NPUs or real-time production systems.
Experience with VLMs, CLIP-style embeddings, multimodal retrieval, RAG, knowledge graphs or agentic systems.
Experience in gaming, robotics, autonomy, AR/VR, industrial automation, healthcare imaging, security, sports analytics or another vision-heavy domain.
Open-source work, research, demos or side projects showing strong visual and technical taste.

Preferred Skills

Location: London preferred, with flexibility depending on the team setup.
Working arrangement: Hybrid or on-site for close product and engineering collaboration.
Compensation: Suggested range for an early-stage London role: £80,000-£150,000 plus equity.
Visa sponsorship: Available where applicable.

Pay range and compensation package

£80,000-£150,000 plus equity.

Equal Opportunity Statement

We are committed to diversity and inclusivity.

Computer Vision Engineer

Requirements

Job Description

Skills

About GrowthStage