翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

基於端到端的可訓練神經網絡基於圖像的序列識別及其在場景文本識別中的應用

Abstract


Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

基於圖像的序列識別已成為計算機視覺領域的長期研究課題。在本文中,我們研究了場景文本識別問題,這是基於圖像的序列識別中最重要和最具挑戰性的任務之一。提出了一種新穎的神經網絡架構,它將特徵提取,序列建模和轉錄集成到一個統一的框架中。與以前的用於場景文本識別的系統相比,所提出的體系結構具有四個獨特的特性:(1)與大多數現有的算法(其組件分別經過訓練和調整)相比,它是端對端可訓練的。 (2)它自然地處理任意長度的序列,不涉及字符分割或水平尺度歸一化。 (3)它不限於任何預定義的詞典,並且在無詞典和基於詞典的場景文本識別任務中均表現出色。 (4)生成有效但小得多的模型,這對於實際應用場景更實用。在包括IIIT-5K,街景文字和ICDAR數據集在內的標準基準上進行的實驗證明了該算法優於現有技術的優勢。此外,該算法在基於圖像的樂譜識別任務中表現良好,顯然證明了其通用性。


1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as "OK" or 15 characters such as "congratulations". Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.


最近,社區看到了神經網絡的強大復興,這主要是由於深度神經網絡模型(尤其是深度卷積神經網絡(DCNN))在各種視覺任務中的巨大成功所激發。但是,與深度神經網絡有關的最新著作大多數都致力於對象類別的檢測或分類[12,25]。在本文中,我們關注計算機視覺中的一個經典問題:基於圖像的序列識別。在現實世界中,穩定的視覺對象(例如場景文本,手寫和樂譜)傾向於以順序而不是孤立的形式出現。與一般對象識別不同,識別此類類似序列的對象通常需要系統預測一系列對象標籤,而不是單個標籤。因此,這種對象的識別自然可以被看作是序列識別問題。類序列對象的另一個獨特屬性是它們的長度可能會急劇變化。例如,英語單詞可以由2個字符組成,例如"確定",也可以由15個字符組成,例如"祝賀"。因此,像DCNN [25,26]這樣最流行的深度模型不能直接應用於序列預測,因為DCNN模型通常對具有固定尺寸的輸入和輸出進行操作,因此無法生成可變長度的標籤序列。


Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.


對於特定的類似序列的對象(例如場景文本),已經嘗試解決該問題。例如,[35,8]中的算法首先檢測單個字符,然後使用DCNN模型識別這些檢測到的字符,該模型使用標記的字符圖像進行訓練。此類方法通常需要訓練強大的字符檢測器,以準確地從原始文字圖像中檢測並裁剪出每個字符。其他一些方法(例如[22])將場景文本識別視為圖像分類問題,併為每個英語單詞(總共90K個單詞)分配一個類別標籤。事實證明,這種訓練有素的模型具有大量的類,很難將其推廣到其他類型的類似序列的對象,例如中文文本,樂譜等,因為此類序列的基本組合數量可以大於一百萬。總之,當前基於DCNN的系統不能直接用於基於圖像的序列識別。


Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.


遞歸神經網絡(RNN)模型是深度神經網絡家族的另一個重要分支,主要設計用於處理序列。 RNN的優點之一是,在訓練和測試中,RNN都不需要序列對象圖像中每個元素的位置。 但是,通常必須執行將輸入對象圖像轉換為圖像特徵序列的預處理步驟。 例如,Graves等。 [16]從手寫文本中提取出一組幾何或圖像特徵,而Su和Lu [33]將單詞圖像轉換為連續的HOG特徵。 預處理步驟獨立於流水線中的後續組件,因此無法以端到端的方式訓練和優化基於RNN的現有系統。


Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almazan` et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.


幾種不基於神經網絡的常規場景文本識別方法也為該領域帶來了有見地的想法和新穎的表示形式。 例如,Almazan`等。 [5]和Rodriguez-Serrano等。 [30]提出將單詞圖像和文本字符串嵌入到一個公共的向量子空間中,並將單詞識別轉換為檢索問題。 姚等。 [36]和戈多等。 [14]使用中級特徵進行場景文本識別。 儘管在標準基準上取得了令人滿意的性能,但是這些方法通常比以前基於神經網絡的算法[8,22]以及本文提出的方法要好。


The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.


本文的主要貢獻是一種新穎的神經網絡模型,該網絡模型是專門為識別圖像中類似序列的對象而設計的。所提出的神經網絡模型是DCNN和RNN的組合,因此被稱為卷積遞歸神經網絡(CRNN)。對於類似序列的對象,CRNN與傳統的神經網絡模型相比具有幾個明顯的優勢:1)可以直接從序列標籤(例如單詞)中學習,不需要詳細的註釋(例如字符); 2)它具有直接從圖像數據中學習信息表示的DCNN的特性,既不需要手工功能也不需要預處理步驟,包括二值化/分割,組件定位等; 3)具有RNN的相同屬性,能夠產生一系列標籤; 4)它不受序列狀物體長度的限制,在訓練和測試階段都只需要高度標準化即可; 5)與現有技術相比,它在場景文本(單詞識別)上表現出更好或極具競爭力的表現[23,8]; 6)它包含的參數比標準DCNN模型少得多,佔用的存儲空間也更少。


2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top。

如圖1所示,CRNN的網絡架構從下到上由三個部分組成,包括卷積層,循環層和轉錄層。


At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.


在CRNN的底部,卷積層會自動從每個輸入圖像中提取特徵序列。 在卷積網絡之上,構建了一個遞歸網絡,用於對由卷積層輸出的特徵序列的每一幀進行預測。 採用CRNN頂部的轉錄層,將循環層的每幀預測轉換為標記序列。 儘管CRNN由不同類型的網絡體系結構(例如CNN和RNN)組成,但可以使用一個損失函數進行聯合訓練。

翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

圖1.網絡架構。 該體系結構包括三個部分:1)卷積層,從輸入圖像中提取特徵序列; 2)循環層,預測每個幀的標籤分佈; 3)轉錄層,它將每幀的預測翻譯成最終的標記序列。


2.1. Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

在CRNN模型中,卷積層的組件是通過從標準CNN模型中獲取卷積層和最大池化層(除去完全連接的層)而構造的。 這樣的組件用於從輸入圖像中提取順序特徵表示。 在送入網絡之前,所有圖像都需要縮放到相同的高度。 然後,從卷積層分量產生的特徵圖中提取特徵向量序列,該卷積層是循環層的輸入。 具體地,特徵序列的每個特徵向量在特徵圖上按列從左到右生成。 這意味著第i個特徵向量是所有地圖的第i列的串聯。 我們設置中每列的寬度固定為單個像素。


As the layers of convolution, max-pooling, and elementwise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original im- age (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

當卷積層,最大池化層和元素激活函數在局部區域上運行時,它們是平移不變的。 因此,特徵圖的每一列對應於原始圖像的一個矩形區域(稱為接收場),並且這些矩形區域從左到右與它們在特徵圖上相應列的順序相同。 如圖2所示,特徵序列中的每個向量都與一個接收場相關聯,並且可以被視為該區域的圖像描述符。


翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.

圖2.接收場。 提取的特徵序列中的每個向量都與輸入圖像上的一個接收場相關聯,並且可以視為該場的特徵向量。


Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequencelike object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.

作為強大,豐富和可訓練的深度卷積特徵已被廣泛用於各種視覺識別任務[25,12]。 某些先前的方法已經使用CNN來學習對諸如場景文本之類的序列對象的魯棒表示[22]。 然而,這些方法通常通過CNN提取整個圖像的整體表示,然後收集局部深層特徵以識別序列狀對象的每個組成部分。 由於CNN要求將輸入圖像縮放到固定大小,以滿足其固定的輸入尺寸,因此,由於序列長度較大,因此不適合用於類似序列的對象。 在CRNN中,我們將深層特徵傳達到順序表示中,以便不變於序列狀對象的長度變化。


2.2. Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution y_t for each frame x_t in the feature sequence x=x_1,…,x_T . The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

一個深層的雙向遞歸神經網絡被構建在卷積層的頂部,作為遞歸層。循環層針對特徵序列x=x_1,…,x_T中的每個幀x_t預測標籤分佈y_t。循環層的優點是三方面的。首先,RNN具有在序列中捕獲上下文信息的強大功能。與單獨處理每個符號相比,使用上下文提示進行基於圖像的序列識別更加穩定和有用。以場景文本識別為例,寬字符可能需要幾個連續的幀才能完整描述(請參閱圖2)。此外,某些模稜兩可的字符在觀察其上下文時更容易區分,例如通過對比字符高度來識別“ il”要比分別識別每個字符要容易。其次,RNN可以將誤差差分反向傳播到其輸入即卷積層,從而使我們能夠在統一網絡中共同訓練遞歸層和卷積層. 第三,RNN可以對任意長度的序列進行操作,從開始到結束。


翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

圖3.(a)LSTM基本單元的結構。 LSTM由單元模塊和三個門組成,即輸入門,輸出門和忘記門。 (b)我們在本文中使用的深度雙向LSTM的結構。 將向前(從左到右)和向後(從右到左)LSTM組合在一起將產生雙向LSTM。 堆疊多個雙向LSTM會導致深度雙向LSTM。


A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame x_t in the sequence, it updates its internal state ht with a non-linear function that takes both current input xt and past state h_t-1 as its inputs: h_t = g(x_t,h_t-1). Then the prediction y_t is made based on ht. In this way, past contexts 〖{x_(t^' )}〗_(t^'

傳統的RNN單元在其輸入和輸出層之間具有自連接的隱藏層。每次收到序列中的幀x_t時,它都會使用非線性函數更新其內部狀態h_t,該函數將當前輸入x_t和過去狀態ht-1都作為其輸入:h_t = g(x_t,h_t-1)。然後,基於h_t做出預測y_t。通過這種方式,捕獲過去的上下文〖{x_(t^' )}〗_(t^'

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].

LSTM是定向的,它僅使用過去的上下文。 但是,在基於圖像的序列中,來自兩個方向的上下文都是有用的並且彼此互補。 因此,我們遵循[17],將兩個LSTM(一個向前和一個向後)組合成雙向LSTM。 此外,可以堆疊多個雙向LSTM,從而產生如圖3.b所示的深層雙向LSTM。 較之較淺的結構,較深的結構可以實現更高級別的抽象,並且在語音識別任務中已經實現了顯著的性能提升[17]。


In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called "Map-to-Sequence", as the bridge between convolutional layers and recurrent layers.

在循環層中,誤差差沿圖3.b所示箭頭的相反方向傳播,即反向傳播時間(BPTT)。 在循環層的底部,將傳播的差異序列連接成圖,將將特徵圖轉換為特徵序列的操作反轉,然後反饋到卷積層。 實際上,我們創建了一個自定義網絡層,稱為"映射到序列",作為卷積層和循環層之間的橋樑。


2.3. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexiconfree mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

轉錄是將RNN進行的每幀預測轉換為標籤序列的過程。 在數學上,轉錄是要根據每幀預測找到具有最高概率的標記序列。 實際上,存在兩種轉錄方式,即無詞典和基於詞典的轉錄。 詞典是預測受其約束的一組標籤序列,例如 拼寫檢查字典。 在無詞典模式下,無需任何詞典即可進行預測。 在基於詞典的模式下,通過選擇概率最高的標籤序列來進行預測。


2.3.1 Probability of label sequence 標籤序列的概率


We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence l conditioned on the per-frame predictions y =y_1,...,y_T , and it ignores the position where each label in l is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.

我們採用Graves等人提出的在連接主義時間分類(CTC)層中定義的條件概率。 [15]。 該概率是針對以每幀預測 y =y_1,...,y_T為條件的標籤序列l定義的,它忽略了l中每個標籤所處的位置。 因此,當我們以這種可能性的負對數似然度為目標來訓練網絡時,我們只需要圖像及其相應的標籤序列,從而避免了為各個字符標註位置的麻煩。

The formulation of the conditional probability is briefly described as follows: The input is a sequence y =〖 y〗_1,...,〖 y〗_T where T is the sequence length. Here, each 〖 y〗_t ϵR^(|L^' |) is a probability distribution over the set L^'=L∪ , where L^' contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by . A sequence-to-sequence mapping function B is defined on sequence DD, where T is the length. B maps π onto l by firstly removing the repeated labels, then removing the ’blank’s. For example, B maps “--hh-e-l-ll-oo--” (’-’ represents ’blank’) onto “hello”. Then, the conditional probability is defined as the sum of probabilities of all π that are mapped by B onto l:

條件概率的公式簡要描述如下:輸入是序列y =〖 y〗_1,...,〖 y〗_T其中,T是序列長度。 在這裡,每個〖 y〗_t ϵR^(|L^' |)都是集合L^'=L∪上的概率分佈,其中L^'包含任務中的所有標籤(例如,所有英文字符)以及以表示的“空白”標籤。 在序列DD上定義了序列到序列的映射函數B,其中T是長度。 B首先刪除重複的標籤,然後刪除“空白”,從而將π映射到l上。 例如,B將“ --hh-e-l-ll-oo-”(“-”代表“空白”)映射到“ hello”。 然後,將條件概率定義為B映射到l上的所有π的概率之和:

p(l│y)=∑_(π:B(π)=1)▒〖p(π│y) 〗 (1)

where the probability of π is defined as p(π│y)=∏_(t=1)^T▒y_(π_t)^t , y_(π_t)^t is the probability of having label π_t at time stamp t. Directly computing Eq. 1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq. 1 can be efficiently computed using the forward-backward algorithm described in [15].

其中π的概率定義為p(π│y)=∏_(t=1)^T▒y_(π_t)^t ,y_(π_t)^t是在時間戳t處具有標籤π_t的概率。 直接計算式 由於求和項的數量成指數增加,因此1在計算上是不可行的。 但是,等式。 使用[15]中描述的前向-後向算法可以有效地計算圖1。

2.3.2 Lexicon-free transcription 無詞典的轉錄

In this mode, the sequence l^* that has the highest probability as defined in Eq. 1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequencel^* is approximately found by l^* ≈ B(〖arg max〗_π p(π|y)), i.e. taking the most probable label π_t at each time stamp t, and map the resulted sequence onto l^* .

在這種模式下,將具有等式1中定義的最高概率的序列l^*作為預測。 由於沒有可精確計算的精確算法,因此我們使用[15]中採用的策略。 序列l^*由l^* ≈ B(〖arg max〗_π p(π|y))近似找到,即在每個時間戳t處取最可能的標記πt,並將得到的序列映射到l^*上。


2.3.3 Lexicon-based transcription 2.3.3基於詞典的轉錄

In lexicon-based mode, each test sample is associated with a lexicon D. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq. 1, i.e. l^*=〖arg max〗_(I∈D) p(l|y). However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates N_δ (l^'), where δ is the maximal edit distance and l^' is the sequence transcribed from y in lexicon-free mode:

在基於詞典的模式下,每個測試樣本都與一個詞典D相關聯。基本上,通過選擇詞典中方程式1中定義的條件概率最高的序列來識別標籤序列,即l^*=〖arg max〗_(I∈D) p(l|y)。 但是,對於大型詞典,例如 在使用5萬個單詞的Hunspell拼寫檢查字典[1]時,要在詞典上進行詳盡搜索,即為詞典中的所有序列計算等式1並選擇概率最高的序列,將非常耗時。 為了解決這個問題,我們觀察到在2.3.2中描述的通過無詞典轉錄預測的標籤序列在編輯距離度量標準下通常接近於真實情況。 這表明我們可以將搜索範圍限制為最鄰近的候選對象N_δ (l^'),其中δ是最大編輯距離,而l^'是在無詞典模式下從y轉錄的序列:

l^* ≈ B(〖arg max〗_( l∈N_δ (l^' ) ) p(l│y)). (2)

The candidates N_δ (l^')can be found efficiently with the BK-tree data structure [9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is O(log |D|), where |D| is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to δ edit distance to the query sequence.

可以使用BK樹數據結構[9]有效地找到候選N_δ (l^'),BK樹數據結構是專門適合於離散度量空間的度量樹。 BK樹的搜索時間複雜度為O(log |D|),其中|D|是詞典大小。 因此,該方案很容易擴展到非常大的詞典。 在我們的方法中,為詞典離線構建BK樹。 然後,通過查找與查詢序列具有小於或等於δ編輯距離的序列,我們對樹進行快速在線搜索。


2.4. Network Training

Denote the training dataset by X = 〖{I_i ,I_i}〗_i , whereI_i is the training image and I_i is the ground truth label sequence. The objective is to minimize the negative log-likelihood of conditional probability of ground truth:

O=-∑_(I_i ,I_i∈X)▒〖log p(I_i│y_i ),(3)〗

where y_i is the sequence produced by the recurrent and convolutional layers from I_i . This objective function calculates a cost value directly from an image and its ground truth label sequence. Therefore, the network can be end-to-end trained on pairs of images and sequences, eliminating the procedure of manually labeling all individual components in training images.

其中y_i是由I_i的循環層和卷積層產生的序列。 該目標函數直接從圖像及其地面真相標籤序列計算成本值。 因此,可以在成對的圖像和序列上對網絡進行端到端訓練,從而省去了手動標記訓練圖像中所有單個組件的過程。


The network is trained with stochastic gradient descent (SGD). Gradients are calculated by the back-propagation algorithm. In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.

該網絡使用隨機梯度下降(SGD)進行訓練。 梯度是通過反向傳播算法計算的。 特別是,在轉錄層中,誤差差異通過前向後算法向後傳播,如[15]所述。 在循環層中,應用反向傳播時間(BPTT)來計算誤差差異。


For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. More importantly, we find that optimization using ADADELTA converges faster than the momentum method.

為了優化,我們使用ADADELTA [37]自動計算每維度的學習率。 與傳統的動量[31]方法相比,ADADELTA不需要手動設置學習速率。 更重要的是,我們發現使用ADADELTA進行優化的收斂速度快於動量法。


3. Experiments

To evaluate the effectiveness of the proposed CRNN model, we conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks. The datasets and setting for training and testing are given in Sec.3.1, the detailed settings of CRNN for scene text images is provided in Sec.3.2, and the results with the comprehensive comparisons are reported in Sec.3.3. To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec.3.4.

為了評估所提出的CRNN模型的有效性,我們針對場景文本識別和樂譜識別的標準基準進行了實驗,這兩者都是具有挑戰性的視覺任務。 訓練和測試的數據集和設置在第3.1節中給出,場景文本圖像的CRNN的詳細設置在第3.2節中提供,經過全面比較的結果在第3.3節中進行了報告。 為了進一步證明CRNN的通用性,我們在第3.4節中對音樂分數識別任務驗證了所提出的算法。


3.1. Datasets

For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. The dataset contains 8 millions training images and their corresponding ground truth words. Such images are generated by a synthetic text engine and are highly realistic. Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.

對於所有用於場景文本識別的實驗,我們使用Jaderberg等人發佈的合成數據集(Synth)。 [20]作為訓練數據。 數據集包含800萬個訓練圖像及其相應的地面真實單詞。 這樣的圖像是由合成文本引擎生成的,具有很高的逼真度。 我們的網絡接受過一次綜合數據訓練,並在所有其他真實世界的測試數據集上進行了測試,而無需對其訓練數據進行任何微調。 即使CRNN模型是完全由合成文本數據訓練而成的,它也可以在標準文本識別基準的真實圖像上很好地工作。


Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).

四個流行的場景文本識別基準用於性能評估,即ICDAR 2003(IC03),ICDAR 2013(IC13),IIIT 5k字(IIIT5k)和街景文本(SVT)。


IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. A full lexicon is built by combining all the per-image lexicons. In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].

IC03 [27]測試數據集包含251個帶有標記文本邊界框的場景圖像。 繼王等。 [34],我們將忽略包含非字母數字字符或少於三個字符的圖像,並使用860個裁剪的文本圖像獲取測試集。 每個測試圖像都與Wang等人定義的50個單詞的詞典相關。 [34]。 通過合併所有按圖像的詞典來構建完整的詞典。 另外,我們使用由Hunspell拼寫檢查字典[1]中的單詞組成的5萬個單詞詞典。

翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Table 1. Network configuration summary. The first row is the top layer. 'k', 's' and 'p' stand for kernel size, stride and padding size respectively

表1.網絡配置摘要。 第一行是頂層。 " k"," s"和" p"分別代表內核大小,步幅和填充大小


IC13 [24] test dataset inherits most of its data from IC03. It contains 1,015 ground truths cropped word images.

IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. Each image has been associated to a 50-words lexicon and a 1k-words lexicon.

SVT [34] test dataset consists of 249 street view images collected from Google Street View. From them 647 word images are cropped. Each word image has a 50 words lexicon defined by Wang et al. [34].

IC13 [24]測試數據集繼承了IC03的大部分數據。 它包含1,015個地面真相裁剪的單詞圖像。

IIIT5k [28]包含從互聯網收集的3,000個裁剪的單詞測試圖像。 每個圖像已與50個單詞的詞典和1000個單詞的詞典相關聯。

SVT [34]測試數據集包含從Google街景收集的249幅街景圖像。 從中裁剪出647個單詞圖像。 每個單詞圖像都有一個由Wang等人定義的50個單詞的詞典。[34]。


3.2. Implementation Details

The network configuration we use in our experiments is summarized in Table 1. The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. A tweak is made in order to make it suitable for recognizing English texts. In the 3rd and the 4th maxpooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. This tweak yields feature maps with larger width, hence longer feature sequence. For example, an image containing 10 characters is typically of size 100×32, from which a feature sequence 25 frames can be generated. This length exceeds the lengths of most English words. On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as 'i' and 'l'.

表1總結了我們在實驗中使用的網絡配置。卷積層的體系結構基於VGG-VeryDeep體系結構[32]。 為了使它適合於識別英文文本,進行了一些調整。 在第3和第4個maxpooling層中,我們採用1×2大小的矩形池窗口,而不是常規的正方形池窗口。 這種調整會產生具有較大寬度的特徵圖,因此特徵序列更長。 例如,包含10個字符的圖像通常大小為100×32,可以從中生成25幀的特徵序列。 該長度超過大多數英語單詞的長度。 最重要的是,矩形合併窗口會產生矩形的接收場(如圖2所示),這對於識別某些形狀較窄的字符(例如" i"和" l")很有幫助。


The network not only has deep convolutional layers, but also has recurrent layers. Both are known to be hard to train. We find that the batch normalization [19] technique is extremely useful for training network of such depth. Two batch normalization layers are inserted after the 5th and 6th convolutional layers respectively. With the batch normalization layers, the training process is greatly accelerated.

網絡不僅具有深層的卷積層,而且具有循環層。 眾所周知,兩者都很難訓練。 我們發現批量歸一化[19]技術對於訓練這種深度的網絡非常有用。 在第五和第六卷積層之後分別插入兩個批處理歸一化層。 使用批處理歸一化層,可以大大加快培訓過程。


We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5- 2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. Networks are trained with ADADELTA, setting the parameter ρ to 0.9. During training, all images are scaled to 100 × 32 in order to accelerate the training process. The training process takes about 50 hours to reach convergence. Testing images are scaled to have height 32. Widths are proportionally scaled with heights, but at least 100 pixels. The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. Testing each sample takes 0.53s on average.

我們在Torch7 [10]框架內實現網絡,併為LSTM單元(在Torch7 / CUDA中),轉錄層(在C ++中)和BK樹數據結構(在C ++中)自定義實現。 實驗是在裝有2.50 GHzIntel®Xeon®E5- 2609 CPU,64GB RAM和NVIDIA®Tesla®K40 GPU的工作站上進行的。 使用ADADELTA訓練網絡,將參數ρ設置為0.9。 在訓練過程中,所有圖像均按比例縮放為100×32,以加快訓練過程。 培訓過程大約需要50個小時才能達到收斂。 將測試圖像縮放為高度32。寬度與高度成比例地縮放,但至少100像素。 在沒有詞典的IC03上測得的平均測試時間為0.16s /樣品。 將近似詞典搜索應用於IC03的50k詞典,並將參數δ設置為3。測試每個樣本平均需要0.53s。


3.3. Comparative Evaluation

All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2.

表2列出了通過建議的CRNN模型和最新技術(包括基於深度模型的方法)獲得的上述四個公共數據集的所有識別準確性。


In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the "Full" lexicon. Note that the model in[22] is trained on a specific dictionary, namely that each word is associated to a class label. Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other>

在受限的詞典情況下,我們的方法始終優於大多數最新技術,並且平均而言勝過[22]中提出的最佳文本閱讀器。 具體來說,我們在IIIT5k上獲得了優異的性能,而與[22]相比,SVT僅在使用"完整"詞典的IC03上獲得了較低的性能。 注意,in [22]中的模型是在特定詞典上訓練的,即每個單詞都與一個類別標籤相關聯。 與[22]不同,CRNN不僅限於識別已知詞典中的單詞,還可以處理隨機字符串(例如電話號碼),句子或其他腳本(如中文單詞)。 因此,CRNN的結果在所有測試數據集上都具有競爭力。


In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. Note that the blanks in the "none" columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. The best performance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. In this sense, our results in the unconstrained lexicon case are still promising.

在不受約束的詞典情況下,我們的方法在SVT上實現了最佳性能,但仍落後於IC03和IC13的某些方法[8,22]。 請注意,表2中"無"列中的空白表示在沒有詞彙的情況下,此類方法無法應用於識別,或者在無限制的情況下未報告識別準確性。 我們的方法僅使用帶有單詞級別標籤的合成文本作為訓練數據,這與PhotoOCR [8]完全不同,後者使用790萬個帶有字符級別註釋的真實單詞圖像進行訓練。 受益於其龐大的字典,[22]在不受約束的詞典情況下報告了最佳性能,但是,它並不是如上所述嚴格不受詞典約束的模型。 從這個意義上講,我們在無約束詞典情況下的結果仍然很有希望。


For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.

為了進一步瞭解該算法相對於其他文本識別方法的優勢,我們對名為E2E Train,Conv Ftrs,CharGT-Free,Unconstrained和Model Size的幾個屬性進行了全面比較,如表3所示。


翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Table 3. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).

表3.各種方法之間的比較。 比較的屬性包括:1)端到端可培訓(E2E培訓); 2)使用直接從圖像中學習的卷積特徵,而不是使用手工的卷積特徵(Conv Ftrs); 3)在訓練過程中不需要角色的地面真相邊界框(無CharGT); 4)不限於預定義的字典(無約束); 5)模型大小(如果使用了端到端可訓練模型),由模型參數的數量(模型大小,M代表百萬)衡量。


E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. As can be observed from Table 3, only the models based on deep neural networks including [22, 21] as well as CRNN have this property.

端到端培訓:此列用於顯示某種文本閱讀模型是否可以進行端到端的培訓,而無需任何預處理或通過幾個單獨的步驟,這表明此類方法對於培訓而言是優雅而乾淨的。 從表3中可以看出,只有基於深度神經網絡的模型(包括[22、21]和CRNN)才具有此屬性。


Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.

Conv Ftrs:此列指示方法是直接使用從訓練圖像中學到的卷積特徵還是手工特徵作為基本表示。


CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.

CharGT-Free:此列用於指示字符級註釋對於訓練模型是否必不可少。 由於CRNN的輸入和輸出標籤可以是一個序列,因此不需要字符級註釋。


Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences.Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.

Unconstrained:此列用於指示訓練後的模型是否僅限於特定詞典,無法處理字典外單詞或隨機序列。請注意,儘管最近的模型是通過標籤嵌入[5,14]和增量學習[ 22]取得了極好的競爭表現,它們被限制在特定的詞典中。


翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Table 2. Recognition accuracies (%) on four datasets. In the second row, "50", "1k", "50k" and "Full" denote the lexicon used, and "None" denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.

表2.四個數據集的識別準確率(%)。 在第二行中," 50"," 1k"," 50k"和"完整"表示使用的詞典,"無"表示不使用詞典的識別。 (* [22]在嚴格意義上不是沒有詞典的,因為它的輸出被限制在一個90k的字典中。


Model Size: This column is to report the storage space of the learned model. In CRNN, all layers have weightsharing connections, and the fully-connected layers are not needed. Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.

模型大小:此列用於報告學習的模型的存儲空間。 在CRNN中,所有層都具有權重共享連接,並且不需要完全連接的層。 因此,CRNN的參數數量遠少於從CNN的變體中學習的模型[22,21],因此與[22,21]相比,模型要小得多。 我們的模型具有830萬個參數,僅佔用33MB RAM(每個參數使用4字節單精度浮點數),因此可以輕鬆地將其移植到移動設備上。

Table 3 clearly shows the differences among different approaches in details, and fully demonstrates the advantages of CRNN over other competing methods. In addition, to test the impact of parameter δ, we experiment different values of δ in Eq. 2. In Fig. 4 we plot the recognition accuracy as a function of δ. Larger δ results in more candidates, thus more accurate lexicon-based transcription. On the other hand, the computational cost grows with larger δ, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. In practice, we choose δ = 3 as a tradeoff between accuracy and speed.

表3清楚地詳細顯示了不同方法之間的差異,並充分證明了CRNN相對於其他競爭方法的優勢。 另外,為了測試參數δ的影響,我們在式中試驗了不同的δ值。 2.在圖4中,我們將識別精度繪製為δ的函數。 δ越大,候選者越多,因此基於詞典的轉錄更加準確。 另一方面,由於較長的BK樹搜索時間以及用於測試的候選序列數量增加,計算成本隨著δ的增加而增長。 實際上,我們選擇δ= 3作為精度和速度之間的折衷。

翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Figure 4. Blue line graph: recognition accuracy as a function parameter δ. Red bars: lexicon search time per sample. Tested on the IC03 dataset with the 50k lexicon.

圖4.藍線圖:識別精度作為函數參數δ。 紅條:每個樣本的詞典搜索時間。 使用50k詞典在IC03數據集上進行了測試。


3.4. Musical Score Recognition

A musical score typically consists of sequences of musical notes arranged on staff lines. Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.

樂譜通常由排列在譜線上的音符序列組成。 識別圖像中的樂譜被稱為光學音樂識別(OMR)問題。 以前的方法通常需要圖像預處理(主要是二值化),人員線檢測和個人筆記識別[29]。 我們將OMR視為序列識別問題,並使用CRNN直接從圖像中預測音符序列。 為簡單起見,我們僅識別音高,忽略所有和絃,併為所有樂譜採用相同的大音階(C大調)。


To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. To prepare the training data needed by CRNN, we collect 2650 images from [2]. Each image contains a fragment of score containing 3 to 20 notes. We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) "Clean", which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) "Synthesized", which is created from "Clean", using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) "Real-World", which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.1

據我們所知,目前尚無用於評估音高識別算法的公共數據集。 為了準備CRNN所需的訓練數據,我們從[2]中收集了2650張圖像。 每個圖像包含一個分數片段,其中包含3至20個音符。 我們為所有圖像手動標記地面真相標記序列(非ezpitches序列)。 通過旋轉,縮放和受噪聲破壞,以及通過將其背景替換為自然圖像,可以將收集的圖像增強到265k訓練樣本。 為了進行測試,我們創建了三個數據集:1)" Clean",其中包含從[2]中收集的260張圖像。 示例如圖5.a所示。 2)使用上面提到的擴充策略,從"清潔"創建的"合成"。 它包含200個樣本,其中一些如圖5.b所示。 3)"真實世界",其中包含200張使用手機攝像頭從樂譜中拍攝的樂譜片段圖像。 示例如圖5.c.1所示。


翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. (c) Real-world score images taken with a mobile phone camera.

圖5.(a)從[2]收集的乾淨的樂譜圖像。(b)合成的樂譜圖像。 (c)用手機相機拍攝的真實分數圖像。


Since we have limited training data, we use a simplified CRNN configuration in order to reduce model capacity. Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. The network is trained on the pairs of images and corresponding label sequences. Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].

由於訓練數據有限,因此我們使用簡化的CRNN配置以減少模型容量。 與選項卡中指定的配置不同。 如圖1所示,刪除了第4和第6卷積層,並將2層雙向LSTM替換為2層單向LSTM。 在圖像對和相應的標籤序列對上訓練網絡。 兩種方法可用於評估識別性能:1)片段準確性,即正確識別的得分片段的百分比; 2)平均編輯距離,即預測音高序列與基本事實之間的平均編輯距離。 為了進行比較,我們評估了兩種商用OMR引擎,即Capella Scan [3]和PhotoScore [4]。


翻譯:端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

Table 4. Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. Performances are evaluated by fragment accuracies and average edit distance ("fragment accuracy/average edit distance").

表4.在我們收集的三個數據集上,CRNN和兩個商業OMR系統之間的音高識別精度比較。 通過片段精度和平均編輯距離("片段準確性/平均編輯距離")評估演奏。


Tab.4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.

表4總結了結果。 CRNN大大優於兩個商業系統。 Capella Scan和PhotoScore系統在Clean數據集上的表現相當不錯,但在合成和真實數據上的性能卻大大下降。 主要原因是他們依靠可靠的二值化來檢測人員線和便條,但是由於不良的光照條件,噪聲破壞和背景混亂,二值化步驟通常無法在合成的和真實的數據上進行。 另一方面,CRNN使用對噪聲和失真具有高度魯棒性的卷積特徵。 此外,CRNN中的循環層可以利用分數中的上下文信息。 每個音符不僅可以自己識別,還可以被附近的音符識別。 因此,可以通過將它們與附近的音符進行比較來識別某些音符,例如 對比他們的垂直位置。


The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.

結果顯示了CRNN的普遍性,因為它可以輕鬆應用於其他基於圖像的序列識別問題,而所需的領域知識最少。 與Capella Scan和PhotoScore相比,我們基於CRNN的系統仍是初步的,缺少許多功能。 但是,它為OMR提供了一種新方案,並且在音高識別方面顯示出了令人鼓舞的功能。


4. Conclusion

In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.

在本文中,我們提出了一種新穎的神經網絡架構,稱為卷積遞歸神經網絡(CRNN),它融合了卷積神經網絡(CNN)和遞歸神經網絡(RNN)的優點。 CRNN能夠拍攝不同尺寸的輸入圖像,併產生不同長度的預測。 它直接在粗糙級別的標籤(例如單詞)上運行,在訓練階段無需為每個單獨的元素(例如字符)提供詳細的註釋。 此外,由於CRNN放棄了常規神經網絡中使用的完全連接的層,因此它導致了更加緊湊和有效的模型。 所有這些特性使CRNN成為基於圖像的序列識別的絕佳方法。


The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.

與傳統方法以及其他基於CNN和RNN的算法相比,現場文本識別基準上的實驗表明CRNN具有優異或極具競爭力的性能。 這證實了所提出算法的優點。 此外,CRNN在光學音樂識別(OMR)的基準上明顯優於其他競爭對手,這證明了CRNN的普遍性。


Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.

實際上,CRNN是一個通用框架,因此可以應用於涉及圖像序列預測的其他領域和問題(例如漢字識別)。 進一步加快CRNN的速度,使其在實際應用中更加實用是另一個值得未來探索的方向。

原文: An End-to-End Trainable Neural Network for Image-based Sequence Its Application to Scene Text Recognition (arXiV 1507.05717)

"


分享到:


相關文章: