Mallow's Blog

On device text recognition in iOS using vision

Apple made lot of improvements in vision framework in iOS 13 which was introduced in iOS 11. One of the improvement is on device text recognition.

Before this release, you need to use VNDetectRectangleRequest to detect characters in an image and should run CoreML model to extract the meaningful text from detected characters. You should take care of all optimisation, error correction and everything from recognised text.

From iOS 13, you need not do all these stuffs to extract the text from an image. Vision will take care of everything from extracting to optimising the performance, with little inputs from your end. 

The main advantage of vision text recognition is the text extraction and processing, that are done on device. The images(or user sensitive data) won’t leave user device, which will reduce the risk of data theft or misuse.  The on device text recognition speed is considerably fast when compared to cloud based process using services like Google vision or AWS Textract.

Vision have two types of recognition levels that can be chosen to recognise text from an image. They are Fast and Accurate. Each have their own pros and cons.


  • It works based on character detection for real time processing of text and it’s optimised for it
  • As this doesn’t run any heavy CoreML models to process images and text, it takes less time to complete the process
  • The result accuracy will be less, as its purely character based detection and uses NLP only for corrections in detected texts
  • Uses less memory, since it doesn’t need to run large neural networks(NN)
  • Apple recommended this for live capturing and processing, like live translators
  • You can prefer this, when there is another heavy process on progress like rendering an AR scene in foreground


  • It works based on neural networks(NN) to detect meaningful text from an image
  • It is purely meant for async processing, which is not necessary to provide results in real time 
  • As it requires running an NN, it takes more time than fast accuracy level
  • The results are more accurate even when you have different font styles and sizes in an image, rotated image and misaligned texts
  • Apple recommended this when images are already available in photo gallery and need high accuracy

You need to choose the accuracy level based on the requirement. If you need high accuracy and committed to meaningful results, go with Accurate accuracy level. If there is any memory constraints in device as your primary feature is consuming it, need results fast for real time action, go with fast accuracy level. 

Just to start with VNRecogniseTextRequest, Check the code snipped below.

Code Snippet:

func process(image: UIImage) {
        guard let imageData = image.pngData() else { return }
        let requestHandler = VNImageRequestHandler(data: imageData, options: [:])
        let request = VNRecognizeTextRequest { (request, error) in
            guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
            self.recognisedTexts = ""
            self.recognisedTexts?.append("Recognised texts:\n\n")
            for observation in observations {
                let candtidates = observation.topCandidates(1)
                for candidate in candtidates {
            DispatchQueue.main.asyncAfter(deadline: .now() + .milliseconds(1)) {
        // Showing progress of text recognition from the given image.
        request.progressHandler = { (request, completed, error) in
            DispatchQueue.main.async {
                self.recognisedTexts = ""
                self.resultLabel.text = "Recognising..\((Int(completed * 100)))%"
        request.recognitionLevel = .accurate // Using accurate recognition level to process static scanned images
        request.usesLanguageCorrection = true
        request.minimumTextHeight = 0.1 // Minimum image height is the fraction of the image height. I want to process texts which are at least 10% of image height, remaining texts will be ignored.
        self.activityIndicator?.isHidden = false
        self.activityIndicator?.startAnimating() .userInteractive).async {
            try? requestHandler.perform([request])

Points to consider:

  • Whenever possible, crop the image and process only the part / section of image needed. So that processing time and memory footprint will get reduce. 
  • You can turn on and off language correction. You are not supposed to use this when you deal with numeric characters. Use this when dealing with non numeric characters, like alphabets.
  • Use your domain knowledge to eliminate common errors. Say if you recognise phone number, you should add validation in recognised phone number string to cross check it accuracy and to not show false value to user.
  • As image quality plays major role in text recognition, use document camera controller to scan the document. Document camera controller is best companion for text recognition.
  • Pass domain specific custom words to help language correction to get better results. 
  • Set minimum text height to increase performance. All texts which are less than the minimumTextHeight will be ignored, so that processing time will be reduced. 
  • If you have heavy running process like rendering AR scene in foreground, ask text recognition to run only in CPU, to free up GPU.
  • Use progressHandler to show progress to the user for better experience. 
  • Cancel on-going text recognition, which you can use to provide cancel button to the user to stop long running recognition process. 

That’s all for today. Please checkout Lakshmi Vaults app in iOS 13.0+ devices, if you want to check use cases of on device text recognition in action. Complete code is hosted here

Karthick S.
iOS Development Team,
Mallow Technologies.

Leave a Reply

%d bloggers like this: