翻譯：端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

2020-04-12 16:07:36 AI機器學習與數據挖掘

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

基於端到端的可訓練神經網絡基於圖像的序列識別及其在場景文本識別中的應用

Abstract

Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

基於圖像的序列識別已成為計算機視覺領域的長期研究課題。在本文中，我們研究了場景文本識別問題，這是基於圖像的序列識別中最重要和最具挑戰性的任務之一。提出了一種新穎的神經網絡架構，它將特徵提取，序列建模和轉錄集成到一個統一的框架中。與以前的用於場景文本識別的系統相比，所提出的體系結構具有四個獨特的特性：（1）與大多數現有的算法（其組件分別經過訓練和調整）相比，它是端對端可訓練的。（2）它自然地處理任意長度的序列，不涉及字符分割或水平尺度歸一化。（3）它不限於任何預定義的詞典，並且在無詞典和基於詞典的場景文本識別任務中均表現出色。（4）生成有效但小得多的模型，這對於實際應用場景更實用。在包括IIIT-5K，街景文字和ICDAR數據集在內的標準基準上進行的實驗證明了該算法優於現有技術的優勢。此外，該算法在基於圖像的樂譜識別任務中表現良好，顯然證明了其通用性。

1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as "OK" or 15 characters such as "congratulations". Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

最近，社區看到了神經網絡的強大復興，這主要是由於深度神經網絡模型（尤其是深度卷積神經網絡（DCNN））在各種視覺任務中的巨大成功所激發。但是，與深度神經網絡有關的最新著作大多數都致力於對象類別的檢測或分類[12，25]。在本文中，我們關注計算機視覺中的一個經典問題：基於圖像的序列識別。在現實世界中，穩定的視覺對象（例如場景文本，手寫和樂譜）傾向於以順序而不是孤立的形式出現。與一般對象識別不同，識別此類類似序列的對象通常需要系統預測一系列對象標籤，而不是單個標籤。因此，這種對象的識別自然可以被看作是序列識別問題。類序列對象的另一個獨特屬性是它們的長度可能會急劇變化。例如，英語單詞可以由2個字符組成，例如"確定"，也可以由15個字符組成，例如"祝賀"。因此，像DCNN [25，26]這樣最流行的深度模型不能直接應用於序列預測，因為DCNN模型通常對具有固定尺寸的輸入和輸出進行操作，因此無法生成可變長度的標籤序列。

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.

對於特定的類似序列的對象（例如場景文本），已經嘗試解決該問題。例如，[35，8]中的算法首先檢測單個字符，然後使用DCNN模型識別這些檢測到的字符，該模型使用標記的字符圖像進行訓練。此類方法通常需要訓練強大的字符檢測器，以準確地從原始文字圖像中檢測並裁剪出每個字符。其他一些方法（例如[22]）將場景文本識別視為圖像分類問題，併為每個英語單詞（總共90K個單詞）分配一個類別標籤。事實證明，這種訓練有素的模型具有大量的類，很難將其推廣到其他類型的類似序列的對象，例如中文文本，樂譜等，因為此類序列的基本組合數量可以大於一百萬。總之，當前基於DCNN的系統不能直接用於基於圖像的序列識別。

Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.

遞歸神經網絡（RNN）模型是深度神經網絡家族的另一個重要分支，主要設計用於處理序列。 RNN的優點之一是，在訓練和測試中，RNN都不需要序列對象圖像中每個元素的位置。但是，通常必須執行將輸入對象圖像轉換為圖像特徵序列的預處理步驟。例如，Graves等。 [16]從手寫文本中提取出一組幾何或圖像特徵，而Su和Lu [33]將單詞圖像轉換為連續的HOG特徵。預處理步驟獨立於流水線中的後續組件，因此無法以端到端的方式訓練和優化基於RNN的現有系統。

Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almazan` et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.

幾種不基於神經網絡的常規場景文本識別方法也為該領域帶來了有見地的想法和新穎的表示形式。例如，Almazan`等。 [5]和Rodriguez-Serrano等。 [30]提出將單詞圖像和文本字符串嵌入到一個公共的向量子空間中，並將單詞識別轉換為檢索問題。姚等。 [36]和戈多等。 [14]使用中級特徵進行場景文本識別。儘管在標準基準上取得了令人滿意的性能，但是這些方法通常比以前基於神經網絡的算法[8，22]以及本文提出的方法要好。

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.

本文的主要貢獻是一種新穎的神經網絡模型，該網絡模型是專門為識別圖像中類似序列的對象而設計的。所提出的神經網絡模型是DCNN和RNN的組合，因此被稱為卷積遞歸神經網絡（CRNN）。對於類似序列的對象，CRNN與傳統的神經網絡模型相比具有幾個明顯的優勢：1）可以直接從序列標籤（例如單詞）中學習，不需要詳細的註釋（例如字符）； 2）它具有直接從圖像數據中學習信息表示的DCNN的特性，既不需要手工功能也不需要預處理步驟，包括二值化/分割，組件定位等； 3）具有RNN的相同屬性，能夠產生一系列標籤； 4）它不受序列狀物體長度的限制，在訓練和測試階段都只需要高度標準化即可； 5）與現有技術相比，它在場景文本（單詞識別）上表現出更好或極具競爭力的表現[23，8]； 6）它包含的參數比標準DCNN模型少得多，佔用的存儲空間也更少。

2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top。

如圖1所示，CRNN的網絡架構從下到上由三個部分組成，包括卷積層，循環層和轉錄層。

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.

在CRNN的底部，卷積層會自動從每個輸入圖像中提取特徵序列。在卷積網絡之上，構建了一個遞歸網絡，用於對由卷積層輸出的特徵序列的每一幀進行預測。採用CRNN頂部的轉錄層，將循環層的每幀預測轉換為標記序列。儘管CRNN由不同類型的網絡體系結構（例如CNN和RNN）組成，但可以使用一個損失函數進行聯合訓練。

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

圖1.網絡架構。該體系結構包括三個部分：1）卷積層，從輸入圖像中提取特徵序列； 2）循環層，預測每個幀的標籤分佈； 3）轉錄層，它將每幀的預測翻譯成最終的標記序列。

2.1. Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

在CRNN模型中，卷積層的組件是通過從標準CNN模型中獲取卷積層和最大池化層（除去完全連接的層）而構造的。這樣的組件用於從輸入圖像中提取順序特徵表示。在送入網絡之前，所有圖像都需要縮放到相同的高度。然後，從卷積層分量產生的特徵圖中提取特徵向量序列，該卷積層是循環層的輸入。具體地，特徵序列的每個特徵向量在特徵圖上按列從左到右生成。這意味著第i個特徵向量是所有地圖的第i列的串聯。我們設置中每列的寬度固定為單個像素。

As the layers of convolution, max-pooling, and elementwise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original im- age (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

當卷積層，最大池化層和元素激活函數在局部區域上運行時，它們是平移不變的。因此，特徵圖的每一列對應於原始圖像的一個矩形區域（稱為接收場），並且這些矩形區域從左到右與它們在特徵圖上相應列的順序相同。如圖2所示，特徵序列中的每個向量都與一個接收場相關聯，並且可以被視為該區域的圖像描述符。

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.

圖2.接收場。提取的特徵序列中的每個向量都與輸入圖像上的一個接收場相關聯，並且可以視為該場的特徵向量。

Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequencelike object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.

作為強大，豐富和可訓練的深度卷積特徵已被廣泛用於各種視覺識別任務[25，12]。某些先前的方法已經使用CNN來學習對諸如場景文本之類的序列對象的魯棒表示[22]。然而，這些方法通常通過CNN提取整個圖像的整體表示，然後收集局部深層特徵以識別序列狀對象的每個組成部分。由於CNN要求將輸入圖像縮放到固定大小，以滿足其固定的輸入尺寸，因此，由於序列長度較大，因此不適合用於類似序列的對象。在CRNN中，我們將深層特徵傳達到順序表示中，以便不變於序列狀對象的長度變化。

2.2. Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution y_t for each frame x_t in the feature sequence x=x_1,…,x_T . The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

一個深層的雙向遞歸神經網絡被構建在卷積層的頂部，作為遞歸層。循環層針對特徵序列x=x_1,…,x_T中的每個幀x_t預測標籤分佈y_t。循環層的優點是三方面的。首先，RNN具有在序列中捕獲上下文信息的強大功能。與單獨處理每個符號相比，使用上下文提示進行基於圖像的序列識別更加穩定和有用。以場景文本識別為例，寬字符可能需要幾個連續的幀才能完整描述（請參閱圖2）。此外，某些模稜兩可的字符在觀察其上下文時更容易區分，例如通過對比字符高度來識別“ il”要比分別識別每個字符要容易。其次，RNN可以將誤差差分反向傳播到其輸入即卷積層，從而使我們能夠在統一網絡中共同訓練遞歸層和卷積層. 第三，RNN可以對任意長度的序列進行操作，從開始到結束。

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

圖3.（a）LSTM基本單元的結構。 LSTM由單元模塊和三個門組成，即輸入門，輸出門和忘記門。（b）我們在本文中使用的深度雙向LSTM的結構。將向前（從左到右）和向後（從右到左）LSTM組合在一起將產生雙向LSTM。堆疊多個雙向LSTM會導致深度雙向LSTM。

A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame x_t in the sequence, it updates its internal state ht with a non-linear function that takes both current input xt and past state h_t-1 as its inputs: h_t = g(x_t,h_t-1). Then the prediction y_t is made based on ht. In this way, past contexts 〖{x_(t^' )}〗_(t^'

傳統的RNN單元在其輸入和輸出層之間具有自連接的隱藏層。每次收到序列中的幀x_t時，它都會使用非線性函數更新其內部狀態h_t，該函數將當前輸入x_t和過去狀態ht-1都作為其輸入：h_t = g(x_t,h_t-1)。然後，基於h_t做出預測y_t。通過這種方式，捕獲過去的上下文〖{x_(t^' )}〗_(t^'

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].

LSTM是定向的，它僅使用過去的上下文。但是，在基於圖像的序列中，來自兩個方向的上下文都是有用的並且彼此互補。因此，我們遵循[17]，將兩個LSTM（一個向前和一個向後）組合成雙向LSTM。此外，可以堆疊多個雙向LSTM，從而產生如圖3.b所示的深層雙向LSTM。較之較淺的結構，較深的結構可以實現更高級別的抽象，並且在語音識別任務中已經實現了顯著的性能提升[17]。

In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called "Map-to-Sequence", as the bridge between convolutional layers and recurrent layers.

在循環層中，誤差差沿圖3.b所示箭頭的相反方向傳播，即反向傳播時間（BPTT）。在循環層的底部，將傳播的差異序列連接成圖，將將特徵圖轉換為特徵序列的操作反轉，然後反饋到卷積層。實際上，我們創建了一個自定義網絡層，稱為"映射到序列"，作為卷積層和循環層之間的橋樑。

2.3. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexiconfree mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

轉錄是將RNN進行的每幀預測轉換為標籤序列的過程。在數學上，轉錄是要根據每幀預測找到具有最高概率的標記序列。實際上，存在兩種轉錄方式，即無詞典和基於詞典的轉錄。詞典是預測受其約束的一組標籤序列，例如拼寫檢查字典。在無詞典模式下，無需任何詞典即可進行預測。在基於詞典的模式下，通過選擇概率最高的標籤序列來進行預測。

2.3.1 Probability of label sequence 標籤序列的概率

We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence l conditioned on the per-frame predictions y =y_1,...,y_T , and it ignores the position where each label in l is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.

我們採用Graves等人提出的在連接主義時間分類（CTC）層中定義的條件概率。 [15]。該概率是針對以每幀預測 y =y_1,...,y_T為條件的標籤序列l定義的，它忽略了l中每個標籤所處的位置。因此，當我們以這種可能性的負對數似然度為目標來訓練網絡時，我們只需要圖像及其相應的標籤序列，從而避免了為各個字符標註位置的麻煩。

The formulation of the conditional probability is briefly described as follows: The input is a sequence y =〖 y〗_1,...,〖 y〗_T where T is the sequence length. Here, each 〖 y〗_t ϵR^(|L^' |) is a probability distribution over the set L^'=L∪ , where L^' contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by . A sequence-to-sequence mapping function B is defined on sequence DD, where T is the length. B maps π onto l by firstly removing the repeated labels, then removing the ’blank’s. For example, B maps “--hh-e-l-ll-oo--” (’-’ represents ’blank’) onto “hello”. Then, the conditional probability is defined as the sum of probabilities of all π that are mapped by B onto l:

條件概率的公式簡要描述如下：輸入是序列y =〖 y〗_1,...,〖 y〗_T其中，T是序列長度。在這裡，每個〖 y〗_t ϵR^(|L^' |)都是集合L^'=L∪上的概率分佈，其中L^'包含任務中的所有標籤（例如，所有英文字符）以及以表示的“空白”標籤。在序列DD上定義了序列到序列的映射函數B，其中T是長度。 B首先刪除重複的標籤，然後刪除“空白”，從而將π映射到l上。例如，B將“ --hh-e-l-ll-oo-”（“-”代表“空白”）映射到“ hello”。然後，將條件概率定義為B映射到l上的所有π的概率之和：

p(l│y)=∑_(π:B(π)=1)▒〖p(π│y) 〗 (1)

where the probability of π is defined as p(π│y)=∏_(t=1)^T▒y_(π_t)^t , y_(π_t)^t is the probability of having label π_t at time stamp t. Directly computing Eq. 1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq. 1 can be efficiently computed using the forward-backward algorithm described in [15].

其中π的概率定義為p(π│y)=∏_(t=1)^T▒y_(π_t)^t ，y_(π_t)^t是在時間戳t處具有標籤π_t的概率。直接計算式由於求和項的數量成指數增加，因此1在計算上是不可行的。但是，等式。使用[15]中描述的前向-後向算法可以有效地計算圖1。

2.3.2 Lexicon-free transcription 無詞典的轉錄

In this mode, the sequence l^* that has the highest probability as defined in Eq. 1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequencel^* is approximately found by l^* ≈ B(〖arg max〗_π p(π|y)), i.e. taking the most probable label π_t at each time stamp t, and map the resulted sequence onto l^* .

在這種模式下，將具有等式1中定義的最高概率的序列l^*作為預測。由於沒有可精確計算的精確算法，因此我們使用[15]中採用的策略。序列l^*由l^* ≈ B(〖arg max〗_π p(π|y))近似找到，即在每個時間戳t處取最可能的標記πt，並將得到的序列映射到l^*上。

2.3.3 Lexicon-based transcription 2.3.3基於詞典的轉錄

In lexicon-based mode, each test sample is associated with a lexicon D. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq. 1, i.e. l^*=〖arg max〗_(I∈D) p(l|y). However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates N_δ (l^'), where δ is the maximal edit distance and l^' is the sequence transcribed from y in lexicon-free mode:

在基於詞典的模式下，每個測試樣本都與一個詞典D相關聯。基本上，通過選擇詞典中方程式1中定義的條件概率最高的序列來識別標籤序列，即l^*=〖arg max〗_(I∈D) p(l|y)。但是，對於大型詞典，例如在使用5萬個單詞的Hunspell拼寫檢查字典[1]時，要在詞典上進行詳盡搜索，即為詞典中的所有序列計算等式1並選擇概率最高的序列，將非常耗時。為了解決這個問題，我們觀察到在2.3.2中描述的通過無詞典轉錄預測的標籤序列在編輯距離度量標準下通常接近於真實情況。這表明我們可以將搜索範圍限制為最鄰近的候選對象N_δ (l^')，其中δ是最大編輯距離，而l^'是在無詞典模式下從y轉錄的序列：

l^* ≈ B(〖arg max〗_( l∈N_δ (l^' ) ) p(l│y)). (2)

The candidates N_δ (l^')can be found efficiently with the BK-tree data structure [9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is O(log |D|), where |D| is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to δ edit distance to the query sequence.

可以使用BK樹數據結構[9]有效地找到候選N_δ (l^')，BK樹數據結構是專門適合於離散度量空間的度量樹。 BK樹的搜索時間複雜度為O(log |D|)，其中|D|是詞典大小。因此，該方案很容易擴展到非常大的詞典。在我們的方法中，為詞典離線構建BK樹。然後，通過查找與查詢序列具有小於或等於δ編輯距離的序列，我們對樹進行快速在線搜索。

2.4. Network Training

Denote the training dataset by X = 〖{I_i ,I_i}〗_i , whereI_i is the training image and I_i is the ground truth label sequence. The objective is to minimize the negative log-likelihood of conditional probability of ground truth:

O=-∑_(I_i ,I_i∈X)▒〖log p(I_i│y_i ),(3)〗

where y_i is the sequence produced by the recurrent and convolutional layers from I_i . This objective function calculates a cost value directly from an image and its ground truth label sequence. Therefore, the network can be end-to-end trained on pairs of images and sequences, eliminating the procedure of manually labeling all individual components in training images.

其中y_i是由I_i的循環層和卷積層產生的序列。該目標函數直接從圖像及其地面真相標籤序列計算成本值。因此，可以在成對的圖像和序列上對網絡進行端到端訓練，從而省去了手動標記訓練圖像中所有單個組件的過程。

The network is trained with stochastic gradient descent (SGD). Gradients are calculated by the back-propagation algorithm. In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.

該網絡使用隨機梯度下降（SGD）進行訓練。梯度是通過反向傳播算法計算的。特別是，在轉錄層中，誤差差異通過前向後算法向後傳播，如[15]所述。在循環層中，應用反向傳播時間（BPTT）來計算誤差差異。

For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. More importantly, we find that optimization using ADADELTA converges faster than the momentum method.

為了優化，我們使用ADADELTA [37]自動計算每維度的學習率。與傳統的動量[31]方法相比，ADADELTA不需要手動設置學習速率。更重要的是，我們發現使用ADADELTA進行優化的收斂速度快於動量法。

3. Experiments

To evaluate the effectiveness of the proposed CRNN model, we conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks. The datasets and setting for training and testing are given in Sec.3.1, the detailed settings of CRNN for scene text images is provided in Sec.3.2, and the results with the comprehensive comparisons are reported in Sec.3.3. To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec.3.4.

為了評估所提出的CRNN模型的有效性，我們針對場景文本識別和樂譜識別的標準基準進行了實驗，這兩者都是具有挑戰性的視覺任務。訓練和測試的數據集和設置在第3.1節中給出，場景文本圖像的CRNN的詳細設置在第3.2節中提供，經過全面比較的結果在第3.3節中進行了報告。為了進一步證明CRNN的通用性，我們在第3.4節中對音樂分數識別任務驗證了所提出的算法。

3.1. Datasets

For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. The dataset contains 8 millions training images and their corresponding ground truth words. Such images are generated by a synthetic text engine and are highly realistic. Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.

對於所有用於場景文本識別的實驗，我們使用Jaderberg等人發佈的合成數據集（Synth）。 [20]作為訓練數據。數據集包含800萬個訓練圖像及其相應的地面真實單詞。這樣的圖像是由合成文本引擎生成的，具有很高的逼真度。我們的網絡接受過一次綜合數據訓練，並在所有其他真實世界的測試數據集上進行了測試，而無需對其訓練數據進行任何微調。即使CRNN模型是完全由合成文本數據訓練而成的，它也可以在標準文本識別基準的真實圖像上很好地工作。

Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).

四個流行的場景文本識別基準用於性能評估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k字（IIIT5k）和街景文本（SVT）。

IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. A full lexicon is built by combining all the per-image lexicons. In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].

IC03 [27]測試數據集包含251個帶有標記文本邊界框的場景圖像。繼王等。 [34]，我們將忽略包含非字母數字字符或少於三個字符的圖像，並使用860個裁剪的文本圖像獲取測試集。每個測試圖像都與Wang等人定義的50個單詞的詞典相關。 [34]。通過合併所有按圖像的詞典來構建完整的詞典。另外，我們使用由Hunspell拼寫檢查字典[1]中的單詞組成的5萬個單詞詞典。

Table 1. Network configuration summary. The first row is the top layer. 'k', 's' and 'p' stand for kernel size, stride and padding size respectively

表1.網絡配置摘要。第一行是頂層。 " k"，" s"和" p"分別代表內核大小，步幅和填充大小

IC13 [24] test dataset inherits most of its data from IC03. It contains 1,015 ground truths cropped word images.

IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. Each image has been associated to a 50-words lexicon and a 1k-words lexicon.

SVT [34] test dataset consists of 249 street view images collected from Google Street View. From them 647 word images are cropped. Each word image has a 50 words lexicon defined by Wang et al. [34].

IC13 [24]測試數據集繼承了IC03的大部分數據。它包含1,015個地面真相裁剪的單詞圖像。

IIIT5k [28]包含從互聯網收集的3,000個裁剪的單詞測試圖像。每個圖像已與50個單詞的詞典和1000個單詞的詞典相關聯。

SVT [34]測試數據集包含從Google街景收集的249幅街景圖像。從中裁剪出647個單詞圖像。每個單詞圖像都有一個由Wang等人定義的50個單詞的詞典。[34]。

3.2. Implementation Details

The network configuration we use in our experiments is summarized in Table 1. The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. A tweak is made in order to make it suitable for recognizing English texts. In the 3rd and the 4th maxpooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. This tweak yields feature maps with larger width, hence longer feature sequence. For example, an image containing 10 characters is typically of size 100×32, from which a feature sequence 25 frames can be generated. This length exceeds the lengths of most English words. On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as 'i' and 'l'.

表1總結了我們在實驗中使用的網絡配置。卷積層的體系結構基於VGG-VeryDeep體系結構[32]。為了使它適合於識別英文文本，進行了一些調整。在第3和第4個maxpooling層中，我們採用1×2大小的矩形池窗口，而不是常規的正方形池窗口。這種調整會產生具有較大寬度的特徵圖，因此特徵序列更長。例如，包含10個字符的圖像通常大小為100×32，可以從中生成25幀的特徵序列。該長度超過大多數英語單詞的長度。最重要的是，矩形合併窗口會產生矩形的接收場（如圖2所示），這對於識別某些形狀較窄的字符（例如" i"和" l"）很有幫助。

The network not only has deep convolutional layers, but also has recurrent layers. Both are known to be hard to train. We find that the batch normalization [19] technique is extremely useful for training network of such depth. Two batch normalization layers are inserted after the 5th and 6th convolutional layers respectively. With the batch normalization layers, the training process is greatly accelerated.

網絡不僅具有深層的卷積層，而且具有循環層。眾所周知，兩者都很難訓練。我們發現批量歸一化[19]技術對於訓練這種深度的網絡非常有用。在第五和第六卷積層之後分別插入兩個批處理歸一化層。使用批處理歸一化層，可以大大加快培訓過程。

We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5- 2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. Networks are trained with ADADELTA, setting the parameter ρ to 0.9. During training, all images are scaled to 100 × 32 in order to accelerate the training process. The training process takes about 50 hours to reach convergence. Testing images are scaled to have height 32. Widths are proportionally scaled with heights, but at least 100 pixels. The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. Testing each sample takes 0.53s on average.

我們在Torch7 [10]框架內實現網絡，併為LSTM單元（在Torch7 / CUDA中），轉錄層（在C ++中）和BK樹數據結構（在C ++中）自定義實現。實驗是在裝有2.50 GHzIntel®Xeon®E5- 2609 CPU，64GB RAM和NVIDIA®Tesla®K40 GPU的工作站上進行的。使用ADADELTA訓練網絡，將參數ρ設置為0.9。在訓練過程中，所有圖像均按比例縮放為100×32，以加快訓練過程。培訓過程大約需要50個小時才能達到收斂。將測試圖像縮放為高度32。寬度與高度成比例地縮放，但至少100像素。在沒有詞典的IC03上測得的平均測試時間為0.16s /樣品。將近似詞典搜索應用於IC03的50k詞典，並將參數δ設置為3。測試每個樣本平均需要0.53s。

3.3. Comparative Evaluation

All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2.

表2列出了通過建議的CRNN模型和最新技術（包括基於深度模型的方法）獲得的上述四個公共數據集的所有識別準確性。

In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the "Full" lexicon. Note that the model in[22] is trained on a specific dictionary, namely that each word is associated to a class label. Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other>

在受限的詞典情況下，我們的方法始終優於大多數最新技術，並且平均而言勝過[22]中提出的最佳文本閱讀器。具體來說，我們在IIIT5k上獲得了優異的性能，而與[22]相比，SVT僅在使用"完整"詞典的IC03上獲得了較低的性能。注意，in [22]中的模型是在特定詞典上訓練的，即每個單詞都與一個類別標籤相關聯。與[22]不同，CRNN不僅限於識別已知詞典中的單詞，還可以處理隨機字符串（例如電話號碼），句子或其他腳本（如中文單詞）。因此，CRNN的結果在所有測試數據集上都具有競爭力。

In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. Note that the blanks in the "none" columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. The best performance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. In this sense, our results in the unconstrained lexicon case are still promising.

在不受約束的詞典情況下，我們的方法在SVT上實現了最佳性能，但仍落後於IC03和IC13的某些方法[8，22]。請注意，表2中"無"列中的空白表示在沒有詞彙的情況下，此類方法無法應用於識別，或者在無限制的情況下未報告識別準確性。我們的方法僅使用帶有單詞級別標籤的合成文本作為訓練數據，這與PhotoOCR [8]完全不同，後者使用790萬個帶有字符級別註釋的真實單詞圖像進行訓練。受益於其龐大的字典，[22]在不受約束的詞典情況下報告了最佳性能，但是，它並不是如上所述嚴格不受詞典約束的模型。從這個意義上講，我們在無約束詞典情況下的結果仍然很有希望。

For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.

為了進一步瞭解該算法相對於其他文本識別方法的優勢，我們對名為E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size的幾個屬性進行了全面比較，如表3所示。

Table 3. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).

表3.各種方法之間的比較。比較的屬性包括：1）端到端可培訓（E2E培訓）； 2）使用直接從圖像中學習的卷積特徵，而不是使用手工的卷積特徵（Conv Ftrs）； 3）在訓練過程中不需要角色的地面真相邊界框（無CharGT）； 4）不限於預定義的字典（無約束）； 5）模型大小（如果使用了端到端可訓練模型），由模型參數的數量（模型大小，M代表百萬）衡量。

E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. As can be observed from Table 3, only the models based on deep neural networks including [22, 21] as well as CRNN have this property.

端到端培訓：此列用於顯示某種文本閱讀模型是否可以進行端到端的培訓，而無需任何預處理或通過幾個單獨的步驟，這表明此類方法對於培訓而言是優雅而乾淨的。從表3中可以看出，只有基於深度神經網絡的模型（包括[22、21]和CRNN）才具有此屬性。

Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.

Conv Ftrs：此列指示方法是直接使用從訓練圖像中學到的卷積特徵還是手工特徵作為基本表示。

CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.

CharGT-Free：此列用於指示字符級註釋對於訓練模型是否必不可少。由於CRNN的輸入和輸出標籤可以是一個序列，因此不需要字符級註釋。

Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences.Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.

Unconstrained：此列用於指示訓練後的模型是否僅限於特定詞典，無法處理字典外單詞或隨機序列。請注意，儘管最近的模型是通過標籤嵌入[5，14]和增量學習[ 22]取得了極好的競爭表現，它們被限制在特定的詞典中。

Table 2. Recognition accuracies (%) on four datasets. In the second row, "50", "1k", "50k" and "Full" denote the lexicon used, and "None" denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.

表2.四個數據集的識別準確率（％）。在第二行中，" 50"，" 1k"，" 50k"和"完整"表示使用的詞典，"無"表示不使用詞典的識別。（* [22]在嚴格意義上不是沒有詞典的，因為它的輸出被限制在一個90k的字典中。

Model Size: This column is to report the storage space of the learned model. In CRNN, all layers have weightsharing connections, and the fully-connected layers are not needed. Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.

模型大小：此列用於報告學習的模型的存儲空間。在CRNN中，所有層都具有權重共享連接，並且不需要完全連接的層。因此，CRNN的參數數量遠少於從CNN的變體中學習的模型[22，21]，因此與[22，21]相比，模型要小得多。我們的模型具有830萬個參數，僅佔用33MB RAM（每個參數使用4字節單精度浮點數），因此可以輕鬆地將其移植到移動設備上。

Table 3 clearly shows the differences among different approaches in details, and fully demonstrates the advantages of CRNN over other competing methods. In addition, to test the impact of parameter δ, we experiment different values of δ in Eq. 2. In Fig. 4 we plot the recognition accuracy as a function of δ. Larger δ results in more candidates, thus more accurate lexicon-based transcription. On the other hand, the computational cost grows with larger δ, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. In practice, we choose δ = 3 as a tradeoff between accuracy and speed.

表3清楚地詳細顯示了不同方法之間的差異，並充分證明了CRNN相對於其他競爭方法的優勢。另外，為了測試參數δ的影響，我們在式中試驗了不同的δ值。 2.在圖4中，我們將識別精度繪製為δ的函數。 δ越大，候選者越多，因此基於詞典的轉錄更加準確。另一方面，由於較長的BK樹搜索時間以及用於測試的候選序列數量增加，計算成本隨著δ的增加而增長。實際上，我們選擇δ= 3作為精度和速度之間的折衷。

Figure 4. Blue line graph: recognition accuracy as a function parameter δ. Red bars: lexicon search time per sample. Tested on the IC03 dataset with the 50k lexicon.

圖4.藍線圖：識別精度作為函數參數δ。紅條：每個樣本的詞典搜索時間。使用50k詞典在IC03數據集上進行了測試。

3.4. Musical Score Recognition

A musical score typically consists of sequences of musical notes arranged on staff lines. Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.

樂譜通常由排列在譜線上的音符序列組成。識別圖像中的樂譜被稱為光學音樂識別（OMR）問題。以前的方法通常需要圖像預處理（主要是二值化），人員線檢測和個人筆記識別[29]。我們將OMR視為序列識別問題，並使用CRNN直接從圖像中預測音符序列。為簡單起見，我們僅識別音高，忽略所有和絃，併為所有樂譜採用相同的大音階（C大調）。

To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. To prepare the training data needed by CRNN, we collect 2650 images from [2]. Each image contains a fragment of score containing 3 to 20 notes. We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) "Clean", which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) "Synthesized", which is created from "Clean", using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) "Real-World", which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.1

據我們所知，目前尚無用於評估音高識別算法的公共數據集。為了準備CRNN所需的訓練數據，我們從[2]中收集了2650張圖像。每個圖像包含一個分數片段，其中包含3至20個音符。我們為所有圖像手動標記地面真相標記序列（非ezpitches序列）。通過旋轉，縮放和受噪聲破壞，以及通過將其背景替換為自然圖像，可以將收集的圖像增強到265k訓練樣本。為了進行測試，我們創建了三個數據集：1）" Clean"，其中包含從[2]中收集的260張圖像。示例如圖5.a所示。 2）使用上面提到的擴充策略，從"清潔"創建的"合成"。它包含200個樣本，其中一些如圖5.b所示。 3）"真實世界"，其中包含200張使用手機攝像頭從樂譜中拍攝的樂譜片段圖像。示例如圖5.c.1所示。

Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. (c) Real-world score images taken with a mobile phone camera.

圖5.（a）從[2]收集的乾淨的樂譜圖像。（b）合成的樂譜圖像。（c）用手機相機拍攝的真實分數圖像。

Since we have limited training data, we use a simplified CRNN configuration in order to reduce model capacity. Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. The network is trained on the pairs of images and corresponding label sequences. Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].

由於訓練數據有限，因此我們使用簡化的CRNN配置以減少模型容量。與選項卡中指定的配置不同。如圖1所示，刪除了第4和第6卷積層，並將2層雙向LSTM替換為2層單向LSTM。在圖像對和相應的標籤序列對上訓練網絡。兩種方法可用於評估識別性能：1）片段準確性，即正確識別的得分片段的百分比； 2）平均編輯距離，即預測音高序列與基本事實之間的平均編輯距離。為了進行比較，我們評估了兩種商用OMR引擎，即Capella Scan [3]和PhotoScore [4]。

Table 4. Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. Performances are evaluated by fragment accuracies and average edit distance ("fragment accuracy/average edit distance").

表4.在我們收集的三個數據集上，CRNN和兩個商業OMR系統之間的音高識別精度比較。通過片段精度和平均編輯距離（"片段準確性/平均編輯距離"）評估演奏。

Tab.4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.

表4總結了結果。 CRNN大大優於兩個商業系統。 Capella Scan和PhotoScore系統在Clean數據集上的表現相當不錯，但在合成和真實數據上的性能卻大大下降。主要原因是他們依靠可靠的二值化來檢測人員線和便條，但是由於不良的光照條件，噪聲破壞和背景混亂，二值化步驟通常無法在合成的和真實的數據上進行。另一方面，CRNN使用對噪聲和失真具有高度魯棒性的卷積特徵。此外，CRNN中的循環層可以利用分數中的上下文信息。每個音符不僅可以自己識別，還可以被附近的音符識別。因此，可以通過將它們與附近的音符進行比較來識別某些音符，例如對比他們的垂直位置。

The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.

結果顯示了CRNN的普遍性，因為它可以輕鬆應用於其他基於圖像的序列識別問題，而所需的領域知識最少。與Capella Scan和PhotoScore相比，我們基於CRNN的系統仍是初步的，缺少許多功能。但是，它為OMR提供了一種新方案，並且在音高識別方面顯示出了令人鼓舞的功能。

4. Conclusion

In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.

在本文中，我們提出了一種新穎的神經網絡架構，稱為卷積遞歸神經網絡（CRNN），它融合了卷積神經網絡（CNN）和遞歸神經網絡（RNN）的優點。 CRNN能夠拍攝不同尺寸的輸入圖像，併產生不同長度的預測。它直接在粗糙級別的標籤（例如單詞）上運行，在訓練階段無需為每個單獨的元素（例如字符）提供詳細的註釋。此外，由於CRNN放棄了常規神經網絡中使用的完全連接的層，因此它導致了更加緊湊和有效的模型。所有這些特性使CRNN成為基於圖像的序列識別的絕佳方法。

The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.

與傳統方法以及其他基於CNN和RNN的算法相比，現場文本識別基準上的實驗表明CRNN具有優異或極具競爭力的性能。這證實了所提出算法的優點。此外，CRNN在光學音樂識別（OMR）的基準上明顯優於其他競爭對手，這證明了CRNN的普遍性。

Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.

實際上，CRNN是一個通用框架，因此可以應用於涉及圖像序列預測的其他領域和問題（例如漢字識別）。進一步加快CRNN的速度，使其在實際應用中更加實用是另一個值得未來探索的方向。

原文： An End-to-End Trainable Neural Network for Image-based Sequence Its Application to Scene Text Recognition （arXiV 1507.05717）

分享到:

閱讀更多 AI機器學習與數據挖掘 的文章

關鍵字: 算法識別端到

剛剛工作的畢業生，一個月只有2000多，是不是太少了？

剛剛:剛剛工作的畢業生，一個月只有2000多，是不是太少了？根據你城市消費水平來看啊，還有你從事的工作，假如你在二三線城市做一份事業單位或者是編制類的工作，薪資水平是隨著你工作年限逐年增長的，而且在年終也有很多福利補貼待遇等等，算下來收入也是可觀的，再舉一個例:-畢業生 2000

為什麼只有edg賺錢？

電競行業作為一個新興產業，這幾年發展勢頭越來越好，IG戰隊，FPX戰隊先後奪得了s8-s9世界賽的冠軍，據俱樂部知情人士透露，除了國內的幾家豪門俱樂部之外，其他俱樂部基本都是虧錢在做的，當然EDG也是:-edg 賺錢:為什麼只有edg賺錢？

網上羅馬仕充電寶20000毫安的，參數怎麼很多樣？哪個是真的？

20000:網上羅馬仕充電寶20000毫安的，參數怎麼很多樣？哪個是真的？天貓旗艦店，或者淘寶旗艦店，或者京東旗艦店肯定包真，質量好，再說可以官方驗證啊，不能圖那十塊五塊的便宜，畢竟一個充電寶要用好久呢，一兩年沒問題的。:-羅馬仕馬仕毫安

我們買的新商品房還沒有拿到房產證，怎麼轉賣最好？

沒有取得房抄產證的房子可以轉讓。但如果確定無法取得房產證的，房產轉讓不受法律保襲護。一般情況下，只有取得房產證的房屋才能確定房屋產權人，才具有轉讓的條件。但如果房屋是合法取得的，以百後可以依法辦理度房:-轉賣房產證商品房拿到:我們買的新商品房還沒有拿到房產證，怎麼轉賣最好？

為什麼突厥人可以成功復國？是大唐的刀不鋒利了麼？

鋒利突厥人你這樣說只能說明你對歷史非常不瞭解，我先用一句話概括突厥被大唐雄兵打的有多慘：三次滅國，背井離鄉，遠赴西亞，打不過，俺躲著你還不行嗎？突厥的意思是中間慫起的頭盔。其來歷已經不可靠，可能有著匈奴、鮮卑或:-復國大唐:為什麼突厥人可以成功復國？是大唐的刀不鋒利了麼？

小高層16層高樓間距60米哪一層比較好？

小高層 60:小高層16層高樓間距60米哪一層比較好？首先需要明白，選擇層數居住與樓間距毫無關係，住在哪一層，肉眼看對面樓的距離，是相差不大的。設定樓間距60米，純粹是混淆視聽。其實，一幢樓的樓層總數確定的情況下，到底哪一層最佳？很簡單，取總層數乘以黃金:-樓間距層高

金銀花盆栽好養嗎？怎麼養？

金銀花可以盆栽，很好養的！金銀花，是忍冬科的常綠纏繞灌木，枝條柔韌修長，多攀爬或匍匐生長。金銀花生性強健，在我國的很多南方省份野外很多地區都能看到它的身影，葉子常年翠綠，到夏季開花，飄香四溢。所以，有:-金銀花盆栽:金銀花盆栽好養嗎？怎麼養？

長城對於抵禦古代匈奴和蒙古人起到了多大作用？

長城真的無用嗎？在今天許多人認為長城無用，古代國家舉國之力建造的長城不過只是文物，就連康熙都曾作詩諷刺，原文如下：萬里經營到海涯，紛紛調發逐浮誇。當時用盡生民力，天下何曾屬爾家。-康熙但真的如此嗎？小:-匈奴抵禦長城:長城對於抵禦古代匈奴和蒙古人起到了多大作用？蒙古人

什麼樹可以嫁接臘梅？

臘梅只能嫁接在不同品種的臘梅上，其他的樹種不行！臘梅的繁殖可以用播種，壓條，嫁接，分株等繁殖方法。播種法因不易保持花卉的原有優良特性，且播種的優點是在於大量繁殖，而臘梅大都只需培植少量幾株，故一般都不:-臘梅嫁接:什麼樹可以嫁接臘梅？

行情堪憂，還有多少教育機構的老師們五一假期有課上的？課時量多不多？

堪憂五一假期:行情堪憂，還有多少教育機構的老師們五一假期有課上的？課時量多不多？事實上，因為教育培訓都是預收費用的模式。但凡有一點點規模的培訓機構老師。在上半年，帶課量是可以得到保證。:-課時量

在農村“立夏節”都有哪些民間習俗？

民間習俗農村:在農村“立夏節”都有哪些民間習俗？在農村“立夏節”都有哪些民間習俗一、農村立夏常見的習俗風俗活動：1、吃雞蛋“立夏吃蛋”習俗由來已久，俗話說“立夏吃了蛋，夏天不疰夏”。據說立夏開始天氣越來越熱，村裡小孩兒會有身體疲勞四肢無力的感覺，吃:-立夏節

男朋友失望分手，但對我還有感覺，答應我兩個月之後可以在一起，我應該怎麼做，才能改變之前他對我的看法？

失望分手看法:男朋友失望分手，但對我還有感覺，答應我兩個月之後可以在一起，我應該怎麼做，才能改變之前他對我的看法？你的這個問題特別的有趣，我覺得你先不要看你要怎麼做才讓他才能讓他對你的印象有所改變，你要去看為什麼是兩個月之後可以在一起，這兩個月他會用來做什麼，為什麼會有這兩個月？例如他的身體碰到了什麼樣的問題嗎？:-答應我

工程分包乙方人員傷殘誰承擔？

承擔:工程分包乙方人員傷殘誰承擔？分包乙方分包致人傷殘責任誰承擔？嚴格來說，需要了解更多傷殘原因才能區分的，作為非專業人士，自己發表一點淺見供題主參考：1、如果甲方是央企的話，他們合同中的責任、義務等條款內已經將自己的責任全部撇開了，更會:-乙方傷殘

有哪些看起來毫不相關的兩個歷史人物實際上有過聯繫？

實際上:有哪些看起來毫不相關的兩個歷史人物實際上有過聯繫？歷史人物聯繫這個詞貌似太寬泛了，就好像有一個調皮的答案說的，胡亥和溥儀相隔2000多年，牽強的找，也有聯繫：都是亡國之君不是。我想題主的意思是兩個看起來應該風馬牛不相及的人物，在歷史上居然是熟悉或是一個時代的:-毫不相關

13年雪鐵龍世嘉自動擋7萬多公里，沒有水泡事故，多少錢能買？

法系車不保值，如果準備常開可以入手，性價比高，價格應該在二至三萬之間，二手車一車一況，一況一價，居體價格看車況。:-錢能水泡:13年雪鐵龍世嘉自動擋7萬多公里，沒有水泡事故，多少錢能買？世嘉自動擋

22+吃土少女17年就有駕駛證了，今年才開始開車，想買個二手昂克賽拉，或者有什麼好建議嗎？

17年駕駛證二手:22+吃土少女17年就有駕駛證了，今年才開始開車，想買個二手昂克賽拉，或者有什麼好建議嗎？建議買日系二手車，開順了賣了，買新車，昂克賽拉無法再次出手時獲得好價格，而且也不省油，開完日系車直接換德系:-昂克賽拉

如何騎車去臺灣騎行？

騎車在臺灣沒有迴歸內地前，最好不要去臺灣，一是國內政策不允許你去臺灣，因為已停止了臺灣個人遊。二是你偷著去臺灣旅遊，安全沒有保障，偷渡客在哪裡也沒有安全保障的。以後內地政策允許個人去臺灣旅遊了，建議那時再:-騎行臺灣:如何騎車去臺灣騎行？

本人預算5萬左右，想買一輛二手法系車！求推薦？

預算:本人預算5萬左右，想買一輛二手法系車！求推薦？ 5萬預算5萬元左右，想買一輛二手法系車？推薦東風標緻老款308車型。1 5萬元可以買標緻308車況好的，沒大事故呢，年限15年左右，公里數3萬左右，手動檔車型。2 標緻308車型，底盤調教紮實，跑高速穩定:-法系二手

14年進口馬自達5PK進口10年道奇酷威買哪個划算？

道奇你好，好高興回答你的問題！14年進口馬自達5和10年月道奇酷威個人感覺馬自達5比較划算。新車價馬5報價29.99萬，酷威19.38萬兩款車都是原裝進口，馬5屬於日系，酷威屬於美系。兩款車不屬於同類車型:-酷威馬自達 14年:14年進口馬自達5PK進口10年道奇酷威買哪個划算？

2020年，河南教育行業國務院特殊津貼推薦，河南大學並列第三，大家怎麼看？

特殊津貼高校人才就要重視，河南省高校人才更要重視，這個人才不是評出了的，而是推薦出來的，沒有推薦，連參評的資格都沒有。國務院特殊津貼人員推薦，不推薦是百分百沒希望，推薦了希望就非常，那麼是什麼是國務院特殊津貼:-河南大學並列 2020年:2020年，河南教育行業國務院特殊津貼推薦，河南大學並列第三，大家怎麼看？

本田CRV2019款1.5T舒適版油耗高嗎？

李老貓說車為你非專業解答各種選車用車問題本田crv定位於一款緊湊級suv產品，主要對飈豐田榮放，日產奇駿，這款車整體市場表現非常突出，2019年全年累計銷量為18.44萬臺，平均月銷1.5萬以上，其深:-舒適版本田油耗:本田CRV2019款1.5T舒適版油耗高嗎？

國外疫情如果沒有得到有效控制，世界會發生什麼事情？頭腦風暴？

1.世界經濟遭到重創疫情影響之下，各行各業基本屬於停工停產的狀態，在世界經濟趨於一體化的今天，停工停產勢必會造成一系列的連鎖反應，最後導致的結果可能會引發金融危機。2.世界格局可能發生改變美國仍是世界:-頭腦風暴控制:國外疫情如果沒有得到有效控制，世界會發生什麼事情？頭腦風暴？疫情國外

本田XRV這款車的整體表現怎麼樣？我想買1.5T自動豪華版，全款多少錢？

如果有15萬元的預算，讓你選擇一臺空間和動力都很不錯的小型SUV，我覺得很多的讀者都會想到本田XRV這款車型。因為本田XRV確實太出色了，和同級別的其他盒子SUV車型相比，這款車在空間和動力上都有優勢:-xrv 自動:本田XRV這款車的整體表現怎麼樣？我想買1.5T自動豪華版，全款多少錢？本田豪華版

現在存款有14萬，借了5萬還沒收回來，該做什麼好？

何去何從:現在存款有14萬，借了5萬還沒收回來，該做什麼好？續租存款利息率較低，可以投資較高收益的項目，比如投資基金，一般情況下可獲得6%一10%的回報。如果行情好可達到50%以上收益，去年不少基金超過這目標。目前受疫情影響，股市在低位震盪，也是基金投資的機會。一:-存款 2300

2070super和5700xt買哪個比較好？

如果是玩遊戲毫無疑問選擇n卡，也就是2070 suep。如果追求性價比可以選擇a卡，也就是5700xt. 為什麼遊戲選n卡呢？首先遊戲廠商針對n卡優化比較多，然後就是功耗小，然後N卡架構執行效率極高，:-:2070super和5700xt買哪個比較好？

生完二胎後，感覺自己有點抑鬱，總是想發火，特別煩躁，怎麼辦？

二胎我是兩個孩子的媽媽，曾經的我和你一樣，生完寶寶我也抑鬱了，我知道抑鬱症真的很痛苦，產後的那段日子我整天都不開心，做什麼事也沒積極性，誰也不想搭理，別人給我說話我就覺得很煩。忍不住衝家人發脾氣。每當一個:-生完抑鬱:生完二胎後，感覺自己有點抑鬱，總是想發火，特別煩躁，怎麼辦？發火

人這一生遇到的人和事為什麼感覺都像是必然的經歷？

感覺:人這一生遇到的人和事為什麼感覺都像是必然的經歷？正所謂有因必有果，所以你今天的因，就會產生明天的果。所以這一切你就會覺得是必然的。生活中大部分是普通人大家的生活規律，生活方式，大致相同。當你看到別人家庭的果，自己家也產生同樣的果，你就會覺得這一切是:-人和經歷

現在校內校外到底教的是美式英語還是英式英語還是混搭英語？

校內:現在校內校外到底教的是美式英語還是英式英語還是混搭英語？校外英式答案肯定是不唯一的！美式英語現在是主流，少量英式發音也個別存在！但對於孩子來說，肯定是混搭英語，因為孩子肯定不是一直一位老師教下去，肯定會換老師！而老師的發音肯定是既有英式的，也有美式的！就連一些英語:-美式英語

上有老下有小，我們真的跳不出這個人生循環了嗎？

上有老魔咒:上有老下有小，我們真的跳不出這個人生循環了嗎？的確如此，儘管現在不結婚，晚婚的人很多，但是從人類繁洐生息的歷史和大多數人來看，成家立業，生兒育女，家庭仍是主流，一個人的生理，心理和生存需求決定了生存狀態，生兒育女，瞻養父母即是義務責任，也是生活動:-下有小

如果外面正在下小雨，你會突然想起了誰？

想起:如果外面正在下小雨，你會突然想起了誰？我最不忘，還是秋日的雨夜，天又涼了幾分，已經需要披上一件薄薄的外套了。臨窗而望，眼見窗臺上的幾株小植物，葉片上沾了幾滴小雨珠，我總喜歡，用小手電去照它們，這樣的小水滴看起來晶瑩晶瑩的，有一種清清涼涼的:-小雨

初中同學許久未見大學期間突然聯繫請吃飯，態度還良好，我給推了，會不會讓人很煩？

初中同學:初中同學許久未見大學期間突然聯繫請吃飯，態度還良好，我給推了，會不會讓人很煩？吃飯許久未見，意思就是交情不怎麼樣，無功不受祿，人家憑什麼那麼熱情，難道真的是多年一來忘不了咱們之間的同學情誼，倍感想念了嗎，不是請幫忙、做業務、就是借錢，十有八九十借錢。我建議還是不要去的好，大家都很忙:-許久未見

現在我覺得認真對某個人說我喜歡你什麼的這種話好惡心，我愛你更說不出口，好惡心，是什麼心理？

出口心理:現在我覺得認真對某個人說我喜歡你什麼的這種話好惡心，我愛你更說不出口，好惡心，是什麼心理？愛你更多的是心裡問題，可能對方還沒有優秀到你滿意的程度，更沒有到那種離不開的地步！愛情最終還是要回歸生活，而生活離不開兩個人的相處，父母終究會老，孩子終究會飛，所以選擇自己的伴侶尤為重要，你現在覺得噁心更:-喜歡你

劇版的《何以笙簫默》和《再見王瀝川》哪一個更好看呢？

再見王瀝川好看:劇版的《何以笙簫默》和《再見王瀝川》哪一個更好看呢？《遇見王瀝川》吧，高以翔的王瀝川太招人稀罕了。長相，身材，家世，人品，才能樣樣好，簡直完美，挑不出任何毛病，實在要說一個缺點的話，那就是太tm完美，天妒英才、才讓他飽受病魔折磨。偶像劇、深情帥氣的男主:-何以笙簫默

計算機專業本科能夠進入字節跳動、華為這些公司做開發嗎？是否還需要繼續讀研？

學歷是求職必備條件。有了工作不能停止對知識的探索。更高的學歷，可以讓你有更專業的技術能力和學習能力，可以讓你拓展自己的交際圈，可以讓你更知名。總之，活到老，學到老，學習對人總是有好處的，技多不壓身嘛！:-字節跳動:計算機專業本科能夠進入字節跳動、華為這些公司做開發嗎？是否還需要繼續讀研？讀研計算機專業

生完二胎的你們，現在有什麼感想？

二胎家庭日常是什麼樣的？是不是覺得家裡多了一個小人兒，溫馨多了？不存在的！生二胎根本是媽媽們的渡劫磨礪！以前週末睡到自然醒，現在全年無休，時刻警醒著，能睡一次懶覺跟過年似的，黑眼圈不說，頭髮呼啦啦地掉:-生完二胎感想:生完二胎的你們，現在有什麼感想？

華北適合種植蠶豆嗎？

華北適合種植蠶豆，種蠶豆的面積大，在西北，華北，都在種植蠶豆，蠶豆莖稈根部有根瘤菌是種植其它農作物的好茬地，特別是土壤培養和防病蟲害起到作用。:-蠶豆種植適合:華北適合種植蠶豆嗎？華北

華為手機更新EMUI10.1系統後效果咋樣？

大家知道現在智能手機的性能不僅僅跟智能手機的硬件有關，還跟智能手機的系統軟件息息相關，在國產智能手機操作系統裡，小米的MIUI系統跟華為的EMUI系統都是比較優秀的操作系統。最近小米推出了小米MIUI:-咋樣華為華為手機更新:華為手機更新EMUI10.1系統後效果咋樣？

大熱天蜜蜂老是爬到箱外結群正常嗎？

蜜蜂爬到:大熱天蜜蜂老是爬到箱外結群正常嗎？盜蜂現在正是夏季，很多地方蜜源稀少，蜂群中可能缺蜜，也是胡蜂猖獗的時間，所以蜂群中是非常容易發生盜蜂的。在蜂群中發生盜蜂的時候，蜂群守衛蜂會增多，但是這種情況引發的蜜蜂在蜂箱外一般不會結團，只是蜜蜂來:-大熱天

辣椒正是生長最佳期，偏偏有的辣椒苗蔫，不是病蟲害是咋回事？

最佳期霧都山客來回答您的問題。最近山客家鄉的村民正在進行辣椒移栽，確實有像題主提到的情形，辣椒苗移栽前長勢蔥蔥，嫩綠喜人，但是移栽後幾天內就出現萎蔫現象，細心觀察也不是被病蟲害危害。那究竟是什麼原因導致辣椒:-苗蔫辣椒咋回事:辣椒正是生長最佳期，偏偏有的辣椒苗蔫，不是病蟲害是咋回事？

手機相機發展的最終形態會是怎樣的？

最近這幾年手機在電子產品行業裡可謂是發展速度非常快，蘋果和華為兩大公司可以說也是，明爭暗鬥，產品一次比一次有賣點，前一段時間華為和蘋果還都推出了手機新品，兩家都在大力宣傳強調著拍照功能，像iPhone:-形態相機手機最終:手機相機發展的最終形態會是怎樣的？

華為為什麼不出一款5寸全面屏手機呢？我想應該會有很多人支持吧？

5寸手機支持:華為為什麼不出一款5寸全面屏手機呢？我想應該會有很多人支持吧？很高興回答你的問題，刷頭條刷出來的問題，看到很多人回答，感覺還有一些觀點沒有寫出，所以我來回答一下。首先，華為為什麼不出小尺寸全面屏手機？其實並不只有華為一家沒有出小屏手機，放眼近期各大手機廠商發佈的:-華為

生吃山芋，生吃胡蘿蔔，還有哪些蔬菜可以生吃呢？

胡蘿蔔蔬菜:生吃山芋，生吃胡蘿蔔，還有哪些蔬菜可以生吃呢？第一種，黃瓜。這個瓜，可不是菜市場中堆放滿滿的青瓜。各位可要睜大眼睛看清楚了，這個黃瓜，青中帶黃，品種屬以前鄉下農戶少量種植的，形態上面來看這種瓜矮、短、圓，表面覆蓋有比較淡的細毛，經水輕輕沖洗之後整:-山芋

為什麼馬鈴薯不宜過早過遲播種？

不宜:為什麼馬鈴薯不宜過早過遲播種？播種過早為什麼馬鈴薯不宜過早過遲播種？馬鈴薯的種植主要是由於氣候條件的限制，過早出苗後容易遇到低溫被凍死，種植晚了容易遇到乾旱和高溫，影響產量。馬鈴薯種植時間的早晚必須根據種植地方的氣候條件來確定。馬鈴薯生長:-馬鈴薯

疫情愈發嚴重，原油為何反而大漲？

原油愈發:疫情愈發嚴重，原油為何反而大漲？疫情愈發嚴重和原油大漲沒有必然關係。但是資金總是從高處流向低處，原油價格跌的越多，投資價值越明顯，相對於其他產業更有投資價值。舉個例子：深圳南山房價均價大約6萬左右，寶安均價5萬左右，如果南山房價漲到:-疫情

生菜球很好吃，怎麼種植才能高產呢？

種植:生菜球很好吃，怎麼種植才能高產呢？高產對環境條件的要求、1.溫度生菜球為喜冷涼、忌高溫作物，種子在4度以上可發芽、以15～20度為發芽適溫。幼苗能耐較低溫度，日平均溫度12度時生長壯健，葉球生長最適溫度為13～16度。不過目前有些結球生菜:-生菜

裝修高手來幫忙看下144平，套內122平，怎麼三房改四房？？

看下這個戶型三房改四房，改一個小房間，應該沒有問題。△原戶型圖這個戶型改四房，能改的方案比較多，但是修改以後是否好用，是一件值得考慮的事情。一、主臥室變為兩個臥室可以將主臥室改為兩個臥室，但是這樣的改動佔:-房改 122:裝修高手來幫忙看下144平，套內122平，怎麼三房改四房？？ 144

大家幫忙看看這個房子如果要砸牆的話，怎麼改比較好？

房子:大家幫忙看看這個房子如果要砸牆的話，怎麼改比較好？這個戶型砸牆，當然可以砸牆，但是在砸牆之前，要搞清楚為什麼要砸牆，砸牆以後有什麼優劣。△原戶型原戶型圖上的白色牆體部分不是承重牆，理論上說否可以砸掉。但是外牆和與旁邊戶型或者是公共區域的共用牆體和圖上:-幫忙

意蜂夏季喝什麼水降溫？

降溫意蜂夏季喝什麼水降溫？氣溫高，蜂巢溫度高的情況下，蜜蜂是通過採水的辦法掛在蜂箱的四壁來蒸發帶走熱量，降低蜂巢溫度同時也能幫助蜂群維持正常的溼度。在平常的情況下，蜜蜂是在室外採自然水的。夏季消耗的水量:-意蜂夏季:意蜂夏季喝什麼水降溫？

黃瓜種子催芽後種植需要打底水嗎？

黃瓜種子:黃瓜種子催芽後種植需要打底水嗎？你好很高興回答這個問題。答案：不用。1-2天可出芽。黃瓜種子催芽：選用飽滿的種子，用30℃水浸泡4小時後催芽。也可用100倍福爾馬林溶液浸泡種子10-20分鐘，洗淨後清水浸種3-4小時，然後於25-3:-催芽黃瓜打底

書友們展示一下自我感覺發揮較好的作品，一起學習？

自我較好這幅作品是參賽的，色彩的搭配，紙張的拼接都是自己設計完成的，一如既往的清新淡雅感覺。書體用的魏碑中楷書，增加了書寫的趣味性。:-書友展示:書友們展示一下自我感覺發揮較好的作品，一起學習？

翻譯：端到端的神經網絡圖像序列識別及其在場景文本識別中的應用

相關文章:

剛剛工作的畢業生，一個月只有2000多，是不是太少了？

為什麼只有edg賺錢？

網上羅馬仕充電寶20000毫安的，參數怎麼很多樣？哪個是真的？

我們買的新商品房還沒有拿到房產證，怎麼轉賣最好？

為什麼突厥人可以成功復國？是大唐的刀不鋒利了麼？

小高層16層高樓間距60米哪一層比較好？

金銀花盆栽好養嗎？怎麼養？

長城對於抵禦古代匈奴和蒙古人起到了多大作用？

什麼樹可以嫁接臘梅？

行情堪憂，還有多少教育機構的老師們五一假期有課上的？課時量多不多？

在農村“立夏節”都有哪些民間習俗？

男朋友失望分手，但對我還有感覺，答應我兩個月之後可以在一起，我應該怎麼做，才能改變之前他對我的看法？

工程分包乙方人員傷殘誰承擔？

有哪些看起來毫不相關的兩個歷史人物實際上有過聯繫？

13年雪鐵龍世嘉自動擋7萬多公里，沒有水泡事故，多少錢能買？

22+吃土少女17年就有駕駛證了，今年才開始開車，想買個二手昂克賽拉，或者有什麼好建議嗎？

如何騎車去臺灣騎行？

本人預算5萬左右，想買一輛二手法系車！求推薦？

14年進口馬自達5PK進口10年道奇酷威買哪個划算？

2020年，河南教育行業國務院特殊津貼推薦，河南大學並列第三，大家怎麼看？

本田CRV2019款1.5T舒適版油耗高嗎？

國外疫情如果沒有得到有效控制，世界會發生什麼事情？頭腦風暴？

本田XRV這款車的整體表現怎麼樣？我想買1.5T自動豪華版，全款多少錢？

現在存款有14萬，借了5萬還沒收回來，該做什麼好？

2070super和5700xt買哪個比較好？

生完二胎後，感覺自己有點抑鬱，總是想發火，特別煩躁，怎麼辦？

人這一生遇到的人和事為什麼感覺都像是必然的經歷？

現在校內校外到底教的是美式英語還是英式英語還是混搭英語？

上有老下有小，我們真的跳不出這個人生循環了嗎？

如果外面正在下小雨，你會突然想起了誰？

初中同學許久未見大學期間突然聯繫請吃飯，態度還良好，我給推了，會不會讓人很煩？

現在我覺得認真對某個人說我喜歡你什麼的這種話好惡心，我愛你更說不出口，好惡心，是什麼心理？

劇版的《何以笙簫默》和《再見王瀝川》哪一個更好看呢？

計算機專業本科能夠進入字節跳動、華為這些公司做開發嗎？是否還需要繼續讀研？

生完二胎的你們，現在有什麼感想？

華北適合種植蠶豆嗎？

華為手機更新EMUI10.1系統後效果咋樣？

大熱天蜜蜂老是爬到箱外結群正常嗎？

辣椒正是生長最佳期，偏偏有的辣椒苗蔫，不是病蟲害是咋回事？

手機相機發展的最終形態會是怎樣的？

華為為什麼不出一款5寸全面屏手機呢？我想應該會有很多人支持吧？

生吃山芋，生吃胡蘿蔔，還有哪些蔬菜可以生吃呢？

為什麼馬鈴薯不宜過早過遲播種？

疫情愈發嚴重，原油為何反而大漲？

生菜球很好吃，怎麼種植才能高產呢？

裝修高手來幫忙看下144平，套內122平，怎麼三房改四房？ ？

大家幫忙看看這個房子如果要砸牆的話，怎麼改比較好？

意蜂夏季喝什麼水降溫？

黃瓜種子催芽後種植需要打底水嗎？

書友們展示一下自我感覺發揮較好的作品，一起學習？

婚前男方擁有個人房產，婚後將這房產賣出用這房產的錢來買房，怎麼算個人財產？

男方出首付，婚後一起還房貸，房產證名字怎麼寫？

女兒離婚，婆家有3套房產，但都是婚前財產，父母應該怎麼樣為女兒爭取利益？

三星低端A40S以及中高端A80值得購買嗎？

如何看待今年高考報考人數超過一千萬人？

97分！利物浦成五大聯賽最高分亞軍，與衛冕冠軍曼城只有“一分之差”，是不是太苦了？

榮耀20值得等待嗎？還是入手榮耀v20？謝謝？

曼聯0-2卡迪夫，博格巴謝場時遭球迷辱罵，其豎大拇指+雙手合十回應，你怎麼評價？

泰迪一定要吃狗糧嗎？

銀行職員將存款500元打成500萬元，儲戶把錢用完了需要擔法律責任嗎？為什麼？

拉布拉多和金毛犬養哪一個比較好？

貝爾本賽季表現並不差，為什麼會遭齊祖徹底放棄？

在取款機裡取到假幣該怎麼辦？

如果有張（J036519610）紙幣，能否認定年號鈔嗎？聽聽大家的看法？

2019版人民幣將要發行，那麼99版和05版現在值得去收藏嗎？

不流通的舊人民幣值多少錢，該如何處理？

第四套人民幣豹子號值錢嗎？

第四套人民幣豹子號值錢嗎？

怎麼分辨902綠幽靈？

EXCEL如何把數值批量轉換成文本？

閬中古城離廣元市多遠？

宜賓：情感糾紛女子背幼子欲跳橋，路過民警救下, 你怎麼看？

Excel中，有哪些激動人心的功能？

Word有什麼技巧是讓你相見恨晚的？

excel中的數據有效性怎麼用？

excel中如何設置excel表格到期後不能使用？

未來5G微基站能否嫁接到路由器上走光纖，解決室內信號弱網速慢的老毛病？

700MHz為何被認為是5G的黃金頻段？

裝修高手來幫忙看下144平，套內122平，怎麼三房改四房？？