翻译：端到端的神经网络图像序列识别及其在场景文本识别中的应用

2020-04-12 16:07:36 AI機器學習與數據挖掘

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用

Abstract

Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

基于图像的序列识别已成为计算机视觉领域的长期研究课题。在本文中，我们研究了场景文本识别问题，这是基于图像的序列识别中最重要和最具挑战性的任务之一。提出了一种新颖的神经网络架构，它将特征提取，序列建模和转录集成到一个统一的框架中。与以前的用于场景文本识别的系统相比，所提出的体系结构具有四个独特的特性：（1）与大多数现有的算法（其组件分别经过训练和调整）相比，它是端对端可训练的。（2）它自然地处理任意长度的序列，不涉及字符分割或水平尺度归一化。（3）它不限于任何预定义的词典，并且在无词典和基于词典的场景文本识别任务中均表现出色。（4）生成有效但小得多的模型，这对于实际应用场景更实用。在包括IIIT-5K，街景文字和ICDAR数据集在内的标准基准上进行的实验证明了该算法优于现有技术的优势。此外，该算法在基于图像的乐谱识别任务中表现良好，显然证明了其通用性。

1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as "OK" or 15 characters such as "congratulations". Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

最近，社区看到了神经网络的强大复兴，这主要是由于深度神经网络模型（尤其是深度卷积神经网络（DCNN））在各种视觉任务中的巨大成功所激发。但是，与深度神经网络有关的最新著作大多数都致力于对象类别的检测或分类[12，25]。在本文中，我们关注计算机视觉中的一个经典问题：基于图像的序列识别。在现实世界中，稳定的视觉对象（例如场景文本，手写和乐谱）倾向于以顺序而不是孤立的形式出现。与一般对象识别不同，识别此类类似序列的对象通常需要系统预测一系列对象标签，而不是单个标签。因此，这种对象的识别自然可以被看作是序列识别问题。类序列对象的另一个独特属性是它们的长度可能会急剧变化。例如，英语单词可以由2个字符组成，例如"确定"，也可以由15个字符组成，例如"祝贺"。因此，像DCNN [25，26]这样最流行的深度模型不能直接应用于序列预测，因为DCNN模型通常对具有固定尺寸的输入和输出进行操作，因此无法生成可变长度的标签序列。

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.

对于特定的类似序列的对象（例如场景文本），已经尝试解决该问题。例如，[35，8]中的算法首先检测单个字符，然后使用DCNN模型识别这些检测到的字符，该模型使用标记的字符图像进行训练。此类方法通常需要训练强大的字符检测器，以准确地从原始文字图像中检测并裁剪出每个字符。其他一些方法（例如[22]）将场景文本识别视为图像分类问题，并为每个英语单词（总共90K个单词）分配一个类别标签。事实证明，这种训练有素的模型具有大量的类，很难将其推广到其他类型的类似序列的对象，例如中文文本，乐谱等，因为此类序列的基本组合数量可以大于一百万。总之，当前基于DCNN的系统不能直接用于基于图像的序列识别。

Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.

递归神经网络（RNN）模型是深度神经网络家族的另一个重要分支，主要设计用于处理序列。 RNN的优点之一是，在训练和测试中，RNN都不需要序列对象图像中每个元素的位置。但是，通常必须执行将输入对象图像转换为图像特征序列的预处理步骤。例如，Graves等。 [16]从手写文本中提取出一组几何或图像特征，而Su和Lu [33]将单词图像转换为连续的HOG特征。预处理步骤独立于流水线中的后续组件，因此无法以端到端的方式训练和优化基于RNN的现有系统。

Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almazan` et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.

几种不基于神经网络的常规场景文本识别方法也为该领域带来了有见地的想法和新颖的表示形式。例如，Almazan`等。 [5]和Rodriguez-Serrano等。 [30]提出将单词图像和文本字符串嵌入到一个公共的向量子空间中，并将单词识别转换为检索问题。姚等。 [36]和戈多等。 [14]使用中级特征进行场景文本识别。尽管在标准基准上取得了令人满意的性能，但是这些方法通常比以前基于神经网络的算法[8，22]以及本文提出的方法要好。

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.

本文的主要贡献是一种新颖的神经网络模型，该网络模型是专门为识别图像中类似序列的对象而设计的。所提出的神经网络模型是DCNN和RNN的组合，因此被称为卷积递归神经网络（CRNN）。对于类似序列的对象，CRNN与传统的神经网络模型相比具有几个明显的优势：1）可以直接从序列标签（例如单词）中学习，不需要详细的注释（例如字符）； 2）它具有直接从图像数据中学习信息表示的DCNN的特性，既不需要手工功能也不需要预处理步骤，包括二值化/分割，组件定位等； 3）具有RNN的相同属性，能够产生一系列标签； 4）它不受序列状物体长度的限制，在训练和测试阶段都只需要高度标准化即可； 5）与现有技术相比，它在场景文本（单词识别）上表现出更好或极具竞争力的表现[23，8]； 6）它包含的参数比标准DCNN模型少得多，占用的存储空间也更少。

2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top。

如图1所示，CRNN的网络架构从下到上由三个部分组成，包括卷积层，循环层和转录层。

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.

在CRNN的底部，卷积层会自动从每个输入图像中提取特征序列。在卷积网络之上，构建了一个递归网络，用于对由卷积层输出的特征序列的每一帧进行预测。采用CRNN顶部的转录层，将循环层的每帧预测转换为标记序列。尽管CRNN由不同类型的网络体系结构（例如CNN和RNN）组成，但可以使用一个损失函数进行联合训练。

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

图1.网络架构。该体系结构包括三个部分：1）卷积层，从输入图像中提取特征序列； 2）循环层，预测每个帧的标签分布； 3）转录层，它将每帧的预测翻译成最终的标记序列。

2.1. Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

在CRNN模型中，卷积层的组件是通过从标准CNN模型中获取卷积层和最大池化层（除去完全连接的层）而构造的。这样的组件用于从输入图像中提取顺序特征表示。在送入网络之前，所有图像都需要缩放到相同的高度。然后，从卷积层分量产生的特征图中提取特征向量序列，该卷积层是循环层的输入。具体地，特征序列的每个特征向量在特征图上按列从左到右生成。这意味着第i个特征向量是所有地图的第i列的串联。我们设置中每列的宽度固定为单个像素。

As the layers of convolution, max-pooling, and elementwise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original im- age (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

当卷积层，最大池化层和元素激活函数在局部区域上运行时，它们是平移不变的。因此，特征图的每一列对应于原始图像的一个矩形区域（称为接收场），并且这些矩形区域从左到右与它们在特征图上相应列的顺序相同。如图2所示，特征序列中的每个向量都与一个接收场相关联，并且可以被视为该区域的图像描述符。

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.

图2.接收场。提取的特征序列中的每个向量都与输入图像上的一个接收场相关联，并且可以视为该场的特征向量。

Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequencelike object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.

作为强大，丰富和可训练的深度卷积特征已被广泛用于各种视觉识别任务[25，12]。某些先前的方法已经使用CNN来学习对诸如场景文本之类的序列对象的鲁棒表示[22]。然而，这些方法通常通过CNN提取整个图像的整体表示，然后收集局部深层特征以识别序列状对象的每个组成部分。由于CNN要求将输入图像缩放到固定大小，以满足其固定的输入尺寸，因此，由于序列长度较大，因此不适合用于类似序列的对象。在CRNN中，我们将深层特征传达到顺序表示中，以便不变于序列状对象的长度变化。

2.2. Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution y_t for each frame x_t in the feature sequence x=x_1,…,x_T . The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

一个深层的双向递归神经网络被构建在卷积层的顶部，作为递归层。循环层针对特征序列x=x_1,…,x_T中的每个帧x_t预测标签分布y_t。循环层的优点是三方面的。首先，RNN具有在序列中捕获上下文信息的强大功能。与单独处理每个符号相比，使用上下文提示进行基于图像的序列识别更加稳定和有用。以场景文本识别为例，宽字符可能需要几个连续的帧才能完整描述（请参阅图2）。此外，某些模棱两可的字符在观察其上下文时更容易区分，例如通过对比字符高度来识别“ il”要比分别识别每个字符要容易。其次，RNN可以将误差差分反向传播到其输入即卷积层，从而使我们能够在统一网络中共同训练递归层和卷积层. 第三，RNN可以对任意长度的序列进行操作，从开始到结束。

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

图3.（a）LSTM基本单元的结构。 LSTM由单元模块和三个门组成，即输入门，输出门和忘记门。（b）我们在本文中使用的深度双向LSTM的结构。将向前（从左到右）和向后（从右到左）LSTM组合在一起将产生双向LSTM。堆叠多个双向LSTM会导致深度双向LSTM。

A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame x_t in the sequence, it updates its internal state ht with a non-linear function that takes both current input xt and past state h_t-1 as its inputs: h_t = g(x_t,h_t-1). Then the prediction y_t is made based on ht. In this way, past contexts 〖{x_(t^' )}〗_(t^'

传统的RNN单元在其输入和输出层之间具有自连接的隐藏层。每次收到序列中的帧x_t时，它都会使用非线性函数更新其内部状态h_t，该函数将当前输入x_t和过去状态ht-1都作为其输入：h_t = g(x_t,h_t-1)。然后，基于h_t做出预测y_t。通过这种方式，捕获过去的上下文〖{x_(t^' )}〗_(t^'

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].

LSTM是定向的，它仅使用过去的上下文。但是，在基于图像的序列中，来自两个方向的上下文都是有用的并且彼此互补。因此，我们遵循[17]，将两个LSTM（一个向前和一个向后）组合成双向LSTM。此外，可以堆叠多个双向LSTM，从而产生如图3.b所示的深层双向LSTM。较之较浅的结构，较深的结构可以实现更高级别的抽象，并且在语音识别任务中已经实现了显着的性能提升[17]。

In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called "Map-to-Sequence", as the bridge between convolutional layers and recurrent layers.

在循环层中，误差差沿图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。在循环层的底部，将传播的差异序列连接成图，将将特征图转换为特征序列的操作反转，然后反馈到卷积层。实际上，我们创建了一个自定义网络层，称为"映射到序列"，作为卷积层和循环层之间的桥梁。

2.3. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexiconfree mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

转录是将RNN进行的每帧预测转换为标签序列的过程。在数学上，转录是要根据每帧预测找到具有最高概率的标记序列。实际上，存在两种转录方式，即无词典和基于词典的转录。词典是预测受其约束的一组标签序列，例如拼写检查字典。在无词典模式下，无需任何词典即可进行预测。在基于词典的模式下，通过选择概率最高的标签序列来进行预测。

2.3.1 Probability of label sequence 标签序列的概率

We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence l conditioned on the per-frame predictions y =y_1,...,y_T , and it ignores the position where each label in l is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.

我们采用Graves等人提出的在连接主义时间分类（CTC）层中定义的条件概率。 [15]。该概率是针对以每帧预测 y =y_1,...,y_T为条件的标签序列l定义的，它忽略了l中每个标签所处的位置。因此，当我们以这种可能性的负对数似然度为目标来训练网络时，我们只需要图像及其相应的标签序列，从而避免了为各个字符标注位置的麻烦。

The formulation of the conditional probability is briefly described as follows: The input is a sequence y =〖 y〗_1,...,〖 y〗_T where T is the sequence length. Here, each 〖 y〗_t ϵR^(|L^' |) is a probability distribution over the set L^'=L∪ , where L^' contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by . A sequence-to-sequence mapping function B is defined on sequence DD, where T is the length. B maps π onto l by firstly removing the repeated labels, then removing the ’blank’s. For example, B maps “--hh-e-l-ll-oo--” (’-’ represents ’blank’) onto “hello”. Then, the conditional probability is defined as the sum of probabilities of all π that are mapped by B onto l:

条件概率的公式简要描述如下：输入是序列y =〖 y〗_1,...,〖 y〗_T其中，T是序列长度。在这里，每个〖 y〗_t ϵR^(|L^' |)都是集合L^'=L∪上的概率分布，其中L^'包含任务中的所有标签（例如，所有英文字符）以及以表示的“空白”标签。在序列DD上定义了序列到序列的映射函数B，其中T是长度。 B首先删除重复的标签，然后删除“空白”，从而将π映射到l上。例如，B将“ --hh-e-l-ll-oo-”（“-”代表“空白”）映射到“ hello”。然后，将条件概率定义为B映射到l上的所有π的概率之和：

p(l│y)=∑_(π:B(π)=1)▒〖p(π│y) 〗 (1)

where the probability of π is defined as p(π│y)=∏_(t=1)^T▒y_(π_t)^t , y_(π_t)^t is the probability of having label π_t at time stamp t. Directly computing Eq. 1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq. 1 can be efficiently computed using the forward-backward algorithm described in [15].

其中π的概率定义为p(π│y)=∏_(t=1)^T▒y_(π_t)^t ，y_(π_t)^t是在时间戳t处具有标签π_t的概率。直接计算式由于求和项的数量成指数增加，因此1在计算上是不可行的。但是，等式。使用[15]中描述的前向-后向算法可以有效地计算图1。

2.3.2 Lexicon-free transcription 无词典的转录

In this mode, the sequence l^* that has the highest probability as defined in Eq. 1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequencel^* is approximately found by l^* ≈ B(〖arg max〗_π p(π|y)), i.e. taking the most probable label π_t at each time stamp t, and map the resulted sequence onto l^* .

在这种模式下，将具有等式1中定义的最高概率的序列l^*作为预测。由于没有可精确计算的精确算法，因此我们使用[15]中采用的策略。序列l^*由l^* ≈ B(〖arg max〗_π p(π|y))近似找到，即在每个时间戳t处取最可能的标记πt，并将得到的序列映射到l^*上。

2.3.3 Lexicon-based transcription 2.3.3基于词典的转录

In lexicon-based mode, each test sample is associated with a lexicon D. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq. 1, i.e. l^*=〖arg max〗_(I∈D) p(l|y). However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates N_δ (l^'), where δ is the maximal edit distance and l^' is the sequence transcribed from y in lexicon-free mode:

在基于词典的模式下，每个测试样本都与一个词典D相关联。基本上，通过选择词典中方程式1中定义的条件概率最高的序列来识别标签序列，即l^*=〖arg max〗_(I∈D) p(l|y)。但是，对于大型词典，例如在使用5万个单词的Hunspell拼写检查字典[1]时，要在词典上进行详尽搜索，即为词典中的所有序列计算等式1并选择概率最高的序列，将非常耗时。为了解决这个问题，我们观察到在2.3.2中描述的通过无词典转录预测的标签序列在编辑距离度量标准下通常接近于真实情况。这表明我们可以将搜索范围限制为最邻近的候选对象N_δ (l^')，其中δ是最大编辑距离，而l^'是在无词典模式下从y转录的序列：

l^* ≈ B(〖arg max〗_( l∈N_δ (l^' ) ) p(l│y)). (2)

The candidates N_δ (l^')can be found efficiently with the BK-tree data structure [9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is O(log |D|), where |D| is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to δ edit distance to the query sequence.

可以使用BK树数据结构[9]有效地找到候选N_δ (l^')，BK树数据结构是专门适合于离散度量空间的度量树。 BK树的搜索时间复杂度为O(log |D|)，其中|D|是词典大小。因此，该方案很容易扩展到非常大的词典。在我们的方法中，为词典离线构建BK树。然后，通过查找与查询序列具有小于或等于δ编辑距离的序列，我们对树进行快速在线搜索。

2.4. Network Training

Denote the training dataset by X = 〖{I_i ,I_i}〗_i , whereI_i is the training image and I_i is the ground truth label sequence. The objective is to minimize the negative log-likelihood of conditional probability of ground truth:

O=-∑_(I_i ,I_i∈X)▒〖log p(I_i│y_i ),(3)〗

where y_i is the sequence produced by the recurrent and convolutional layers from I_i . This objective function calculates a cost value directly from an image and its ground truth label sequence. Therefore, the network can be end-to-end trained on pairs of images and sequences, eliminating the procedure of manually labeling all individual components in training images.

其中y_i是由I_i的循环层和卷积层产生的序列。该目标函数直接从图像及其地面真相标签序列计算成本值。因此，可以在成对的图像和序列上对网络进行端到端训练，从而省去了手动标记训练图像中所有单个组件的过程。

The network is trained with stochastic gradient descent (SGD). Gradients are calculated by the back-propagation algorithm. In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.

该网络使用随机梯度下降（SGD）进行训练。梯度是通过反向传播算法计算的。特别是，在转录层中，误差差异通过前向后算法向后传播，如[15]所述。在循环层中，应用反向传播时间（BPTT）来计算误差差异。

For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. More importantly, we find that optimization using ADADELTA converges faster than the momentum method.

为了优化，我们使用ADADELTA [37]自动计算每维度的学习率。与传统的动量[31]方法相比，ADADELTA不需要手动设置学习速率。更重要的是，我们发现使用ADADELTA进行优化的收敛速度快于动量法。

3. Experiments

To evaluate the effectiveness of the proposed CRNN model, we conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks. The datasets and setting for training and testing are given in Sec.3.1, the detailed settings of CRNN for scene text images is provided in Sec.3.2, and the results with the comprehensive comparisons are reported in Sec.3.3. To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec.3.4.

为了评估所提出的CRNN模型的有效性，我们针对场景文本识别和乐谱识别的标准基准进行了实验，这两者都是具有挑战性的视觉任务。训练和测试的数据集和设置在第3.1节中给出，场景文本图像的CRNN的详细设置在第3.2节中提供，经过全面比较的结果在第3.3节中进行了报告。为了进一步证明CRNN的通用性，我们在第3.4节中对音乐分数识别任务验证了所提出的算法。

3.1. Datasets

For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. The dataset contains 8 millions training images and their corresponding ground truth words. Such images are generated by a synthetic text engine and are highly realistic. Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.

对于所有用于场景文本识别的实验，我们使用Jaderberg等人发布的合成数据集（Synth）。 [20]作为训练数据。数据集包含800万个训练图像及其相应的地面真实单词。这样的图像是由合成文本引擎生成的，具有很高的逼真度。我们的网络接受过一次综合数据训练，并在所有其他真实世界的测试数据集上进行了测试，而无需对其训练数据进行任何微调。即使CRNN模型是完全由合成文本数据训练而成的，它也可以在标准文本识别基准的真实图像上很好地工作。

Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).

四个流行的场景文本识别基准用于性能评估，即ICDAR 2003（IC03），ICDAR 2013（IC13），IIIT 5k字（IIIT5k）和街景文本（SVT）。

IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. A full lexicon is built by combining all the per-image lexicons. In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].

IC03 [27]测试数据集包含251个带有标记文本边界框的场景图像。继王等。 [34]，我们将忽略包含非字母数字字符或少于三个字符的图像，并使用860个裁剪的文本图像获取测试集。每个测试图像都与Wang等人定义的50个单词的词典相关。 [34]。通过合并所有按图像的词典来构建完整的词典。另外，我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个单词词典。

Table 1. Network configuration summary. The first row is the top layer. 'k', 's' and 'p' stand for kernel size, stride and padding size respectively

表1.网络配置摘要。第一行是顶层。 " k"，" s"和" p"分别代表内核大小，步幅和填充大小

IC13 [24] test dataset inherits most of its data from IC03. It contains 1,015 ground truths cropped word images.

IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. Each image has been associated to a 50-words lexicon and a 1k-words lexicon.

SVT [34] test dataset consists of 249 street view images collected from Google Street View. From them 647 word images are cropped. Each word image has a 50 words lexicon defined by Wang et al. [34].

IC13 [24]测试数据集继承了IC03的大部分数据。它包含1,015个地面真相裁剪的单词图像。

IIIT5k [28]包含从互联网收集的3,000个裁剪的单词测试图像。每个图像已与50个单词的词典和1000个单词的词典相关联。

SVT [34]测试数据集包含从Google街景收集的249幅街景图像。从中裁剪出647个单词图像。每个单词图像都有一个由Wang等人定义的50个单词的词典。[34]。

3.2. Implementation Details

The network configuration we use in our experiments is summarized in Table 1. The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. A tweak is made in order to make it suitable for recognizing English texts. In the 3rd and the 4th maxpooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. This tweak yields feature maps with larger width, hence longer feature sequence. For example, an image containing 10 characters is typically of size 100×32, from which a feature sequence 25 frames can be generated. This length exceeds the lengths of most English words. On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as 'i' and 'l'.

表1总结了我们在实验中使用的网络配置。卷积层的体系结构基于VGG-VeryDeep体系结构[32]。为了使它适合于识别英文文本，进行了一些调整。在第3和第4个maxpooling层中，我们采用1×2大小的矩形池窗口，而不是常规的正方形池窗口。这种调整会产生具有较大宽度的特征图，因此特征序列更长。例如，包含10个字符的图像通常大小为100×32，可以从中生成25帧的特征序列。该长度超过大多数英语单词的长度。最重要的是，矩形合并窗口会产生矩形的接收场（如图2所示），这对于识别某些形状较窄的字符（例如" i"和" l"）很有帮助。

The network not only has deep convolutional layers, but also has recurrent layers. Both are known to be hard to train. We find that the batch normalization [19] technique is extremely useful for training network of such depth. Two batch normalization layers are inserted after the 5th and 6th convolutional layers respectively. With the batch normalization layers, the training process is greatly accelerated.

网络不仅具有深层的卷积层，而且具有循环层。众所周知，两者都很难训练。我们发现批量归一化[19]技术对于训练这种深度的网络非常有用。在第五和第六卷积层之后分别插入两个批处理归一化层。使用批处理归一化层，可以大大加快培训过程。

We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5- 2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. Networks are trained with ADADELTA, setting the parameter ρ to 0.9. During training, all images are scaled to 100 × 32 in order to accelerate the training process. The training process takes about 50 hours to reach convergence. Testing images are scaled to have height 32. Widths are proportionally scaled with heights, but at least 100 pixels. The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. Testing each sample takes 0.53s on average.

我们在Torch7 [10]框架内实现网络，并为LSTM单元（在Torch7 / CUDA中），转录层（在C ++中）和BK树数据结构（在C ++中）自定义实现。实验是在装有2.50 GHzIntel®Xeon®E5- 2609 CPU，64GB RAM和NVIDIA®Tesla®K40 GPU的工作站上进行的。使用ADADELTA训练网络，将参数ρ设置为0.9。在训练过程中，所有图像均按比例缩放为100×32，以加快训练过程。培训过程大约需要50个小时才能达到收敛。将测试图像缩放为高度32。宽度与高度成比例地缩放，但至少100像素。在没有词典的IC03上测得的平均测试时间为0.16s /样品。将近似词典搜索应用于IC03的50k词典，并将参数δ设置为3。测试每个样本平均需要0.53s。

3.3. Comparative Evaluation

All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2.

表2列出了通过建议的CRNN模型和最新技术（包括基于深度模型的方法）获得的上述四个公共数据集的所有识别准确性。

In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the "Full" lexicon. Note that the model in[22] is trained on a specific dictionary, namely that each word is associated to a class label. Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other>

在受限的词典情况下，我们的方法始终优于大多数最新技术，并且平均而言胜过[22]中提出的最佳文本阅读器。具体来说，我们在IIIT5k上获得了优异的性能，而与[22]相比，SVT仅在使用"完整"词典的IC03上获得了较低的性能。注意，in [22]中的模型是在特定词典上训练的，即每个单词都与一个类别标签相关联。与[22]不同，CRNN不仅限于识别已知词典中的单词，还可以处理随机字符串（例如电话号码），句子或其他脚本（如中文单词）。因此，CRNN的结果在所有测试数据集上都具有竞争力。

In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. Note that the blanks in the "none" columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. The best performance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. In this sense, our results in the unconstrained lexicon case are still promising.

在不受约束的词典情况下，我们的方法在SVT上实现了最佳性能，但仍落后于IC03和IC13的某些方法[8，22]。请注意，表2中"无"列中的空白表示在没有词汇的情况下，此类方法无法应用于识别，或者在无限制的情况下未报告识别准确性。我们的方法仅使用带有单词级别标签的合成文本作为训练数据，这与PhotoOCR [8]完全不同，后者使用790万个带有字符级别注释的真实单词图像进行训练。受益于其庞大的字典，[22]在不受约束的词典情况下报告了最佳性能，但是，它并不是如上所述严格不受词典约束的模型。从这个意义上讲，我们在无约束词典情况下的结果仍然很有希望。

For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.

为了进一步了解该算法相对于其他文本识别方法的优势，我们对名为E2E Train，Conv Ftrs，CharGT-Free，Unconstrained和Model Size的几个属性进行了全面比较，如表3所示。

Table 3. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).

表3.各种方法之间的比较。比较的属性包括：1）端到端可培训（E2E培训）； 2）使用直接从图像中学习的卷积特征，而不是使用手工的卷积特征（Conv Ftrs）； 3）在训练过程中不需要角色的地面真相边界框（无CharGT）； 4）不限于预定义的字典（无约束）； 5）模型大小（如果使用了端到端可训练模型），由模型参数的数量（模型大小，M代表百万）衡量。

E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. As can be observed from Table 3, only the models based on deep neural networks including [22, 21] as well as CRNN have this property.

端到端培训：此列用于显示某种文本阅读模型是否可以进行端到端的培训，而无需任何预处理或通过几个单独的步骤，这表明此类方法对于培训而言是优雅而干净的。从表3中可以看出，只有基于深度神经网络的模型（包括[22、21]和CRNN）才具有此属性。

Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.

Conv Ftrs：此列指示方法是直接使用从训练图像中学到的卷积特征还是手工特征作为基本表示。

CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.

CharGT-Free：此列用于指示字符级注释对于训练模型是否必不可少。由于CRNN的输入和输出标签可以是一个序列，因此不需要字符级注释。

Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences.Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.

Unconstrained：此列用于指示训练后的模型是否仅限于特定词典，无法处理字典外单词或随机序列。请注意，尽管最近的模型是通过标签嵌入[5，14]和增量学习[ 22]取得了极好的竞争表现，它们被限制在特定的词典中。

Table 2. Recognition accuracies (%) on four datasets. In the second row, "50", "1k", "50k" and "Full" denote the lexicon used, and "None" denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.

表2.四个数据集的识别准确率（％）。在第二行中，" 50"，" 1k"，" 50k"和"完整"表示使用的词典，"无"表示不使用词典的识别。（* [22]在严格意义上不是没有词典的，因为它的输出被限制在一个90k的字典中。

Model Size: This column is to report the storage space of the learned model. In CRNN, all layers have weightsharing connections, and the fully-connected layers are not needed. Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.

模型大小：此列用于报告学习的模型的存储空间。在CRNN中，所有层都具有权重共享连接，并且不需要完全连接的层。因此，CRNN的参数数量远少于从CNN的变体中学习的模型[22，21]，因此与[22，21]相比，模型要小得多。我们的模型具有830万个参数，仅占用33MB RAM（每个参数使用4字节单精度浮点数），因此可以轻松地将其移植到移动设备上。

Table 3 clearly shows the differences among different approaches in details, and fully demonstrates the advantages of CRNN over other competing methods. In addition, to test the impact of parameter δ, we experiment different values of δ in Eq. 2. In Fig. 4 we plot the recognition accuracy as a function of δ. Larger δ results in more candidates, thus more accurate lexicon-based transcription. On the other hand, the computational cost grows with larger δ, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. In practice, we choose δ = 3 as a tradeoff between accuracy and speed.

表3清楚地详细显示了不同方法之间的差异，并充分证明了CRNN相对于其他竞争方法的优势。另外，为了测试参数δ的影响，我们在式中试验了不同的δ值。 2.在图4中，我们将识别精度绘制为δ的函数。 δ越大，候选者越多，因此基于词典的转录更加准确。另一方面，由于较长的BK树搜索时间以及用于测试的候选序列数量增加，计算成本随着δ的增加而增长。实际上，我们选择δ= 3作为精度和速度之间的折衷。

Figure 4. Blue line graph: recognition accuracy as a function parameter δ. Red bars: lexicon search time per sample. Tested on the IC03 dataset with the 50k lexicon.

图4.蓝线图：识别精度作为函数参数δ。红条：每个样本的词典搜索时间。使用50k词典在IC03数据集上进行了测试。

3.4. Musical Score Recognition

A musical score typically consists of sequences of musical notes arranged on staff lines. Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.

乐谱通常由排列在谱线上的音符序列组成。识别图像中的乐谱被称为光学音乐识别（OMR）问题。以前的方法通常需要图像预处理（主要是二值化），人员线检测和个人笔记识别[29]。我们将OMR视为序列识别问题，并使用CRNN直接从图像中预测音符序列。为简单起见，我们仅识别音高，忽略所有和弦，并为所有乐谱采用相同的大音阶（C大调）。

To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. To prepare the training data needed by CRNN, we collect 2650 images from [2]. Each image contains a fragment of score containing 3 to 20 notes. We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) "Clean", which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) "Synthesized", which is created from "Clean", using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) "Real-World", which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.1

据我们所知，目前尚无用于评估音高识别算法的公共数据集。为了准备CRNN所需的训练数据，我们从[2]中收集了2650张图像。每个图像包含一个分数片段，其中包含3至20个音符。我们为所有图像手动标记地面真相标记序列（非ezpitches序列）。通过旋转，缩放和受噪声破坏，以及通过将其背景替换为自然图像，可以将收集的图像增强到265k训练样本。为了进行测试，我们创建了三个数据集：1）" Clean"，其中包含从[2]中收集的260张图像。示例如图5.a所示。 2）使用上面提到的扩充策略，从"清洁"创建的"合成"。它包含200个样本，其中一些如图5.b所示。 3）"真实世界"，其中包含200张使用手机摄像头从乐谱中拍摄的乐谱片段图像。示例如图5.c.1所示。

Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. (c) Real-world score images taken with a mobile phone camera.

图5.（a）从[2]收集的干净的乐谱图像。（b）合成的乐谱图像。（c）用手机相机拍摄的真实分数图像。

Since we have limited training data, we use a simplified CRNN configuration in order to reduce model capacity. Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. The network is trained on the pairs of images and corresponding label sequences. Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].

由于训练数据有限，因此我们使用简化的CRNN配置以减少模型容量。与选项卡中指定的配置不同。如图1所示，删除了第4和第6卷积层，并将2层双向LSTM替换为2层单向LSTM。在图像对和相应的标签序列对上训练网络。两种方法可用于评估识别性能：1）片段准确性，即正确识别的得分片段的百分比； 2）平均编辑距离，即预测音高序列与基本事实之间的平均编辑距离。为了进行比较，我们评估了两种商用OMR引擎，即Capella Scan [3]和PhotoScore [4]。

Table 4. Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. Performances are evaluated by fragment accuracies and average edit distance ("fragment accuracy/average edit distance").

表4.在我们收集的三个数据集上，CRNN和两个商业OMR系统之间的音高识别精度比较。通过片段精度和平均编辑距离（"片段准确性/平均编辑距离"）评估演奏。

Tab.4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.

表4总结了结果。 CRNN大大优于两个商业系统。 Capella Scan和PhotoScore系统在Clean数据集上的表现相当不错，但在合成和真实数据上的性能却大大下降。主要原因是他们依靠可靠的二值化来检测人员线和便条，但是由于不良的光照条件，噪声破坏和背景混乱，二值化步骤通常无法在合成的和真实的数据上进行。另一方面，CRNN使用对噪声和失真具有高度鲁棒性的卷积特征。此外，CRNN中的循环层可以利用分数中的上下文信息。每个音符不仅可以自己识别，还可以被附近的音符识别。因此，可以通过将它们与附近的音符进行比较来识别某些音符，例如对比他们的垂直位置。

The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.

结果显示了CRNN的普遍性，因为它可以轻松应用于其他基于图像的序列识别问题，而所需的领域知识最少。与Capella Scan和PhotoScore相比，我们基于CRNN的系统仍是初步的，缺少许多功能。但是，它为OMR提供了一种新方案，并且在音高识别方面显示出了令人鼓舞的功能。

4. Conclusion

In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.

在本文中，我们提出了一种新颖的神经网络架构，称为卷积递归神经网络（CRNN），它融合了卷积神经网络（CNN）和递归神经网络（RNN）的优点。 CRNN能够拍摄不同尺寸的输入图像，并产生不同长度的预测。它直接在粗糙级别的标签（例如单词）上运行，在训练阶段无需为每个单独的元素（例如字符）提供详细的注释。此外，由于CRNN放弃了常规神经网络中使用的完全连接的层，因此它导致了更加紧凑和有效的模型。所有这些特性使CRNN成为基于图像的序列识别的绝佳方法。

The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.

与传统方法以及其他基于CNN和RNN的算法相比，现场文本识别基准上的实验表明CRNN具有优异或极具竞争力的性能。这证实了所提出算法的优点。此外，CRNN在光学音乐识别（OMR）的基准上明显优于其他竞争对手，这证明了CRNN的普遍性。

Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.

实际上，CRNN是一个通用框架，因此可以应用于涉及图像序列预测的其他领域和问题（例如汉字识别）。进一步加快CRNN的速度，使其在实际应用中更加实用是另一个值得未来探索的方向。

原文： An End-to-End Trainable Neural Network for Image-based Sequence Its Application to Scene Text Recognition （arXiV 1507.05717）

分享到:

閱讀更多 AI機器學習與數據挖掘 的文章

關鍵字: 人工智能电脑端到

刚刚工作的毕业生，一个月只有2000多，是不是太少了？

刚刚:刚刚工作的毕业生，一个月只有2000多，是不是太少了？根据你城市消费水平来看啊，还有你从事的工作，假如你在二三线城市做一份事业单位或者是编制类的工作，薪资水平是随着你工作年限逐年增长的，而且在年终也有很多福利补贴待遇等等，算下来收入也是可观的，再举一个例:-毕业生 2000

为什么只有edg赚钱？

电竞行业作为一个新兴产业，这几年发展势头越来越好，IG战队，FPX战队先后夺得了s8-s9世界赛的冠军，据俱乐部知情人士透露，除了国内的几家豪门俱乐部之外，其他俱乐部基本都是亏钱在做的，当然EDG也是:-edg 赚钱:为什么只有edg赚钱？

网上罗马仕充电宝20000毫安的，参数怎么很多样？哪个是真的？

20000:网上罗马仕充电宝20000毫安的，参数怎么很多样？哪个是真的？天猫旗舰店，或者淘宝旗舰店，或者京东旗舰店肯定包真，质量好，再说可以官方验证啊，不能图那十块五块的便宜，毕竟一个充电宝要用好久呢，一两年没问题的。:-罗马仕马仕毫安

我们买的新商品房还没有拿到房产证，怎么转卖最好？

没有取得房抄产证的房子可以转让。但如果确定无法取得房产证的，房产转让不受法律保袭护。一般情况下，只有取得房产证的房屋才能确定房屋产权人，才具有转让的条件。但如果房屋是合法取得的，以百后可以依法办理度房:-转卖房产证商品房拿到:我们买的新商品房还没有拿到房产证，怎么转卖最好？

为什么突厥人可以成功复国？是大唐的刀不锋利了么？

锋利突厥人你这样说只能说明你对历史非常不了解，我先用一句话概括突厥被大唐雄兵打的有多惨：三次灭国，背井离乡，远赴西亚，打不过，俺躲着你还不行吗？突厥的意思是中间怂起的头盔。其来历已经不可靠，可能有着匈奴、鲜卑或:-复国大唐:为什么突厥人可以成功复国？是大唐的刀不锋利了么？

小高层16层高楼间距60米哪一层比较好？

小高层 60:小高层16层高楼间距60米哪一层比较好？首先需要明白，选择层数居住与楼间距毫无关系，住在哪一层，肉眼看对面楼的距离，是相差不大的。设定楼间距60米，纯粹是混淆视听。其实，一幢楼的楼层总数确定的情况下，到底哪一层最佳？很简单，取总层数乘以黄金:-楼间距层高

金银花盆栽好养吗？怎么养？

金银花可以盆栽，很好养的！金银花，是忍冬科的常绿缠绕灌木，枝条柔韧修长，多攀爬或匍匐生长。金银花生性强健，在我国的很多南方省份野外很多地区都能看到它的身影，叶子常年翠绿，到夏季开花，飘香四溢。所以，有:-金银花盆栽:金银花盆栽好养吗？怎么养？

长城对于抵御古代匈奴和蒙古人起到了多大作用？

长城真的无用吗？在今天许多人认为长城无用，古代国家举国之力建造的长城不过只是文物，就连康熙都曾作诗讽刺，原文如下：万里经营到海涯，纷纷调发逐浮夸。当时用尽生民力，天下何曾属尔家。-康熙但真的如此吗？小:-匈奴抵御长城:长城对于抵御古代匈奴和蒙古人起到了多大作用？蒙古人

什么树可以嫁接腊梅？

腊梅只能嫁接在不同品种的腊梅上，其他的树种不行！腊梅的繁殖可以用播种，压条，嫁接，分株等繁殖方法。播种法因不易保持花卉的原有优良特性，且播种的优点是在于大量繁殖，而腊梅大都只需培植少量几株，故一般都不:-腊梅嫁接:什么树可以嫁接腊梅？

行情堪忧，还有多少教育机构的老师们五一假期有课上的？课时量多不多？

堪忧五一假期:行情堪忧，还有多少教育机构的老师们五一假期有课上的？课时量多不多？事实上，因为教育培训都是预收费用的模式。但凡有一点点规模的培训机构老师。在上半年，带课量是可以得到保证。:-课时量

在农村“立夏节”都有哪些民间习俗？

民间习俗农村:在农村“立夏节”都有哪些民间习俗？在农村“立夏节”都有哪些民间习俗一、农村立夏常见的习俗风俗活动：1、吃鸡蛋“立夏吃蛋”习俗由来已久，俗话说“立夏吃了蛋，夏天不疰夏”。据说立夏开始天气越来越热，村里小孩儿会有身体疲劳四肢无力的感觉，吃:-立夏节

男朋友失望分手，但对我还有感觉，答应我两个月之后可以在一起，我应该怎么做，才能改变之前他对我的看法？

失望分手看法:男朋友失望分手，但对我还有感觉，答应我两个月之后可以在一起，我应该怎么做，才能改变之前他对我的看法？你的这个问题特别的有趣，我觉得你先不要看你要怎么做才让他才能让他对你的印象有所改变，你要去看为什么是两个月之后可以在一起，这两个月他会用来做什么，为什么会有这两个月？例如他的身体碰到了什么样的问题吗？:-答应我

工程分包乙方人员伤残谁承担？

承担:工程分包乙方人员伤残谁承担？分包乙方分包致人伤残责任谁承担？严格来说，需要了解更多伤残原因才能区分的，作为非专业人士，自己发表一点浅见供题主参考：1、如果甲方是央企的话，他们合同中的责任、义务等条款内已经将自己的责任全部撇开了，更会:-乙方伤残

有哪些看起来毫不相关的两个历史人物实际上有过联系？

实际上:有哪些看起来毫不相关的两个历史人物实际上有过联系？历史人物联系这个词貌似太宽泛了，就好像有一个调皮的答案说的，胡亥和溥仪相隔2000多年，牵强的找，也有联系：都是亡国之君不是。我想题主的意思是两个看起来应该风马牛不相及的人物，在历史上居然是熟悉或是一个时代的:-毫不相关

13年雪铁龙世嘉自动挡7万多公里，没有水泡事故，多少钱能买？

法系车不保值，如果准备常开可以入手，性价比高，价格应该在二至三万之间，二手车一车一况，一况一价，居体价格看车况。:-钱能水泡:13年雪铁龙世嘉自动挡7万多公里，没有水泡事故，多少钱能买？世嘉自动挡

22+吃土少女17年就有驾驶证了，今年才开始开车，想买个二手昂克赛拉，或者有什么好建议吗？

17年驾驶证二手:22+吃土少女17年就有驾驶证了，今年才开始开车，想买个二手昂克赛拉，或者有什么好建议吗？建议买日系二手车，开顺了卖了，买新车，昂克赛拉无法再次出手时获得好价格，而且也不省油，开完日系车直接换德系:-昂克赛拉

如何骑车去台湾骑行？

骑车在台湾没有回归内地前，最好不要去台湾，一是国内政策不允许你去台湾，因为已停止了台湾个人游。二是你偷着去台湾旅游，安全没有保障，偷渡客在哪里也没有安全保障的。以后内地政策允许个人去台湾旅游了，建议那时再:-骑行台湾:如何骑车去台湾骑行？

本人预算5万左右，想买一辆二手法系车！求推荐？

预算:本人预算5万左右，想买一辆二手法系车！求推荐？ 5万预算5万元左右，想买一辆二手法系车？推荐东风标致老款308车型。1 5万元可以买标致308车况好的，没大事故呢，年限15年左右，公里数3万左右，手动档车型。2 标致308车型，底盘调教扎实，跑高速稳定:-法系二手

14年进口马自达5PK进口10年道奇酷威买哪个划算？

道奇你好，好高兴回答你的问题！14年进口马自达5和10年月道奇酷威个人感觉马自达5比较划算。新车价马5报价29.99万，酷威19.38万两款车都是原装进口，马5属于日系，酷威属于美系。两款车不属于同类车型:-酷威马自达 14年:14年进口马自达5PK进口10年道奇酷威买哪个划算？

2020年，河南教育行业国务院特殊津贴推荐，河南大学并列第三，大家怎么看？

特殊津贴高校人才就要重视，河南省高校人才更要重视，这个人才不是评出了的，而是推荐出来的，没有推荐，连参评的资格都没有。国务院特殊津贴人员推荐，不推荐是百分百没希望，推荐了希望就非常，那么是什么是国务院特殊津贴:-河南大学并列 2020年:2020年，河南教育行业国务院特殊津贴推荐，河南大学并列第三，大家怎么看？

本田CRV2019款1.5T舒适版油耗高吗？

李老猫说车为你非专业解答各种选车用车问题本田crv定位于一款紧凑级suv产品，主要对飚丰田荣放，日产奇骏，这款车整体市场表现非常突出，2019年全年累计销量为18.44万台，平均月销1.5万以上，其深:-舒适版本田油耗:本田CRV2019款1.5T舒适版油耗高吗？

国外疫情如果没有得到有效控制，世界会发生什么事情？头脑风暴？

1.世界经济遭到重创疫情影响之下，各行各业基本属于停工停产的状态，在世界经济趋于一体化的今天，停工停产势必会造成一系列的连锁反应，最后导致的结果可能会引发金融危机。2.世界格局可能发生改变美国仍是世界:-头脑风暴控制:国外疫情如果没有得到有效控制，世界会发生什么事情？头脑风暴？疫情国外

本田XRV这款车的整体表现怎么样？我想买1.5T自动豪华版，全款多少钱？

如果有15万元的预算，让你选择一台空间和动力都很不错的小型SUV，我觉得很多的读者都会想到本田XRV这款车型。因为本田XRV确实太出色了，和同级别的其他盒子SUV车型相比，这款车在空间和动力上都有优势:-xrv 自动:本田XRV这款车的整体表现怎么样？我想买1.5T自动豪华版，全款多少钱？本田豪华版

现在存款有14万，借了5万还没收回来，该做什么好？

何去何从:现在存款有14万，借了5万还没收回来，该做什么好？续租存款利息率较低，可以投资较高收益的项目，比如投资基金，一般情况下可获得6%一10%的回报。如果行情好可达到50%以上收益，去年不少基金超过这目标。目前受疫情影响，股市在低位震荡，也是基金投资的机会。一:-存款 2300

2070super和5700xt买哪个比较好？

如果是玩游戏毫无疑问选择n卡，也就是2070 suep。如果追求性价比可以选择a卡，也就是5700xt. 为什么游戏选n卡呢？首先游戏厂商针对n卡优化比较多，然后就是功耗小，然后N卡架构执行效率极高，:-:2070super和5700xt买哪个比较好？

生完二胎后，感觉自己有点抑郁，总是想发火，特别烦躁，怎么办？

二胎我是两个孩子的妈妈，曾经的我和你一样，生完宝宝我也抑郁了，我知道抑郁症真的很痛苦，产后的那段日子我整天都不开心，做什么事也没积极性，谁也不想搭理，别人给我说话我就觉得很烦。忍不住冲家人发脾气。每当一个:-生完抑郁:生完二胎后，感觉自己有点抑郁，总是想发火，特别烦躁，怎么办？发火

人这一生遇到的人和事为什么感觉都像是必然的经历？

感觉:人这一生遇到的人和事为什么感觉都像是必然的经历？正所谓有因必有果，所以你今天的因，就会产生明天的果。所以这一切你就会觉得是必然的。生活中大部分是普通人大家的生活规律，生活方式，大致相同。当你看到别人家庭的果，自己家也产生同样的果，你就会觉得这一切是:-人和经历

现在校内校外到底教的是美式英语还是英式英语还是混搭英语？

校内:现在校内校外到底教的是美式英语还是英式英语还是混搭英语？校外英式答案肯定是不唯一的！美式英语现在是主流，少量英式发音也个别存在！但对于孩子来说，肯定是混搭英语，因为孩子肯定不是一直一位老师教下去，肯定会换老师！而老师的发音肯定是既有英式的，也有美式的！就连一些英语:-美式英语

上有老下有小，我们真的跳不出这个人生循环了吗？

上有老魔咒:上有老下有小，我们真的跳不出这个人生循环了吗？的确如此，尽管现在不结婚，晚婚的人很多，但是从人类繁洐生息的历史和大多数人来看，成家立业，生儿育女，家庭仍是主流，一个人的生理，心理和生存需求決定了生存状态，生儿育女，瞻养父母即是义务责任，也是生活动:-下有小

如果外面正在下小雨，你会突然想起了谁？

想起:如果外面正在下小雨，你会突然想起了谁？我最不忘，还是秋日的雨夜，天又凉了几分，已经需要披上一件薄薄的外套了。临窗而望，眼见窗台上的几株小植物，叶片上沾了几滴小雨珠，我总喜欢，用小手电去照它们，这样的小水滴看起来晶莹晶莹的，有一种清清凉凉的:-小雨

初中同学许久未见大学期间突然联系请吃饭，态度还良好，我给推了，会不会让人很烦？

初中同学:初中同学许久未见大学期间突然联系请吃饭，态度还良好，我给推了，会不会让人很烦？吃饭许久未见，意思就是交情不怎么样，无功不受禄，人家凭什么那么热情，难道真的是多年一来忘不了咱们之间的同学情谊，倍感想念了吗，不是请帮忙、做业务、就是借钱，十有八九十借钱。我建议还是不要去的好，大家都很忙:-许久未见

现在我觉得认真对某个人说我喜欢你什么的这种话好恶心，我爱你更说不出口，好恶心，是什么心理？

出口心理:现在我觉得认真对某个人说我喜欢你什么的这种话好恶心，我爱你更说不出口，好恶心，是什么心理？爱你更多的是心里问题，可能对方还没有优秀到你满意的程度，更没有到那种离不开的地步！爱情最终还是要回归生活，而生活离不开两个人的相处，父母终究会老，孩子终究会飞，所以选择自己的伴侣尤为重要，你现在觉得恶心更:-喜欢你

剧版的《何以笙箫默》和《再见王沥川》哪一个更好看呢？

再见王沥川好看:剧版的《何以笙箫默》和《再见王沥川》哪一个更好看呢？《遇见王沥川》吧，高以翔的王沥川太招人稀罕了。长相，身材，家世，人品，才能样样好，简直完美，挑不出任何毛病，实在要说一个缺点的话，那就是太tm完美，天妒英才、才让他饱受病魔折磨。偶像剧、深情帅气的男主:-何以笙箫默

计算机专业本科能够进入字节跳动、华为这些公司做开发吗？是否还需要继续读研？

学历是求职必备条件。有了工作不能停止对知识的探索。更高的学历，可以让你有更专业的技术能力和学习能力，可以让你拓展自己的交际圈，可以让你更知名。总之，活到老，学到老，学习对人总是有好处的，技多不压身嘛！:-字节跳动:计算机专业本科能够进入字节跳动、华为这些公司做开发吗？是否还需要继续读研？读研计算机专业

生完二胎的你们，现在有什么感想？

二胎家庭日常是什么样的？是不是觉得家里多了一个小人儿，温馨多了？不存在的！生二胎根本是妈妈们的渡劫磨砺！以前周末睡到自然醒，现在全年无休，时刻警醒着，能睡一次懒觉跟过年似的，黑眼圈不说，头发呼啦啦地掉:-生完二胎感想:生完二胎的你们，现在有什么感想？

华北适合种植蚕豆吗？

华北适合种植蚕豆，种蚕豆的面积大，在西北，华北，都在种植蚕豆，蚕豆茎秆根部有根瘤菌是种植其它农作物的好茬地，特别是土壤培养和防病虫害起到作用。:-蚕豆种植适合:华北适合种植蚕豆吗？华北

华为手机更新EMUI10.1系统后效果咋样？

大家知道现在智能手机的性能不仅仅跟智能手机的硬件有关，还跟智能手机的系统软件息息相关，在国产智能手机操作系统里，小米的MIUI系统跟华为的EMUI系统都是比较优秀的操作系统。最近小米推出了小米MIUI:-咋样华为华为手机更新:华为手机更新EMUI10.1系统后效果咋样？

大热天蜜蜂老是爬到箱外结群正常吗？

蜜蜂爬到:大热天蜜蜂老是爬到箱外结群正常吗？盗蜂现在正是夏季，很多地方蜜源稀少，蜂群中可能缺蜜，也是胡蜂猖獗的时间，所以蜂群中是非常容易发生盗蜂的。在蜂群中发生盗蜂的时候，蜂群守卫蜂会增多，但是这种情况引发的蜜蜂在蜂箱外一般不会结团，只是蜜蜂来:-大热天

辣椒正是生长最佳期，偏偏有的辣椒苗蔫，不是病虫害是咋回事？

最佳期雾都山客来回答您的问题。最近山客家乡的村民正在进行辣椒移栽，确实有像题主提到的情形，辣椒苗移栽前长势葱葱，嫩绿喜人，但是移栽后几天内就出现萎蔫现象，细心观察也不是被病虫害危害。那究竟是什么原因导致辣椒:-苗蔫辣椒咋回事:辣椒正是生长最佳期，偏偏有的辣椒苗蔫，不是病虫害是咋回事？

手机相机发展的最终形态会是怎样的？

最近这几年手机在电子产品行业里可谓是发展速度非常快，苹果和华为两大公司可以说也是，明争暗斗，产品一次比一次有卖点，前一段时间华为和苹果还都推出了手机新品，两家都在大力宣传强调着拍照功能，像iPhone:-形态相机手机最终:手机相机发展的最终形态会是怎样的？

华为为什么不出一款5寸全面屏手机呢？我想应该会有很多人支持吧？

5寸手机支持:华为为什么不出一款5寸全面屏手机呢？我想应该会有很多人支持吧？很高兴回答你的问题，刷头条刷出来的问题，看到很多人回答，感觉还有一些观点没有写出，所以我来回答一下。首先，华为为什么不出小尺寸全面屏手机？其实并不只有华为一家没有出小屏手机，放眼近期各大手机厂商发布的:-华为

生吃山芋，生吃胡萝卜，还有哪些蔬菜可以生吃呢？

胡萝卜蔬菜:生吃山芋，生吃胡萝卜，还有哪些蔬菜可以生吃呢？第一种，黄瓜。这个瓜，可不是菜市场中堆放满满的青瓜。各位可要睁大眼睛看清楚了，这个黄瓜，青中带黄，品种属以前乡下农户少量种植的，形态上面来看这种瓜矮、短、圆，表面覆盖有比较淡的细毛，经水轻轻冲洗之后整:-山芋

为什么马铃薯不宜过早过迟播种？

不宜:为什么马铃薯不宜过早过迟播种？播种过早为什么马铃薯不宜过早过迟播种？马铃薯的种植主要是由于气候条件的限制，过早出苗后容易遇到低温被冻死，种植晚了容易遇到干旱和高温，影响产量。马铃薯种植时间的早晚必须根据种植地方的气候条件来确定。马铃薯生长:-马铃薯

疫情愈发严重，原油为何反而大涨？

原油愈发:疫情愈发严重，原油为何反而大涨？疫情愈发严重和原油大涨没有必然关系。但是资金总是从高处流向低处，原油价格跌的越多，投资价值越明显，相对于其他产业更有投资价值。举个例子：深圳南山房价均价大约6万左右，宝安均价5万左右，如果南山房价涨到:-疫情

生菜球很好吃，怎么种植才能高产呢？

种植:生菜球很好吃，怎么种植才能高产呢？高产对环境条件的要求、1.温度生菜球为喜冷凉、忌高温作物，种子在4度以上可发芽、以15～20度为发芽适温。幼苗能耐较低温度，日平均温度12度时生长壮健，叶球生长最适温度为13～16度。不过目前有些结球生菜:-生菜

装修高手来帮忙看下144平，套内122平，怎么三房改四房？？

看下这个户型三房改四房，改一个小房间，应该没有问题。△原户型图这个户型改四房，能改的方案比较多，但是修改以后是否好用，是一件值得考虑的事情。一、主卧室变为两个卧室可以将主卧室改为两个卧室，但是这样的改动占:-房改 122:装修高手来帮忙看下144平，套内122平，怎么三房改四房？？ 144

大家帮忙看看这个房子如果要砸墙的话，怎么改比较好？

房子:大家帮忙看看这个房子如果要砸墙的话，怎么改比较好？这个户型砸墙，当然可以砸墙，但是在砸墙之前，要搞清楚为什么要砸墙，砸墙以后有什么优劣。△原户型原户型图上的白色墙体部分不是承重墙，理论上说否可以砸掉。但是外墙和与旁边户型或者是公共区域的共用墙体和图上:-帮忙

意蜂夏季喝什么水降温？

降温意蜂夏季喝什么水降温？气温高，蜂巢温度高的情况下，蜜蜂是通过采水的办法挂在蜂箱的四壁来蒸发带走热量，降低蜂巢温度同时也能帮助蜂群维持正常的湿度。在平常的情况下，蜜蜂是在室外采自然水的。夏季消耗的水量:-意蜂夏季:意蜂夏季喝什么水降温？

黄瓜种子催芽后种植需要打底水吗？

黄瓜种子:黄瓜种子催芽后种植需要打底水吗？你好很高兴回答这个问题。答案：不用。1-2天可出芽。黄瓜种子催芽：选用饱满的种子，用30℃水浸泡4小时后催芽。也可用100倍福尔马林溶液浸泡种子10-20分钟，洗净后清水浸种3-4小时，然后于25-3:-催芽黄瓜打底

书友们展示一下自我感觉发挥较好的作品，一起学习？

自我较好这幅作品是参赛的，色彩的搭配，纸张的拼接都是自己设计完成的，一如既往的清新淡雅感觉。书体用的魏碑中楷书，增加了书写的趣味性。:-书友展示:书友们展示一下自我感觉发挥较好的作品，一起学习？

翻译：端到端的神经网络图像序列识别及其在场景文本识别中的应用

相關文章:

刚刚工作的毕业生，一个月只有2000多，是不是太少了？

为什么只有edg赚钱？

网上罗马仕充电宝20000毫安的，参数怎么很多样？哪个是真的？

我们买的新商品房还没有拿到房产证，怎么转卖最好？

为什么突厥人可以成功复国？是大唐的刀不锋利了么？

小高层16层高楼间距60米哪一层比较好？

金银花盆栽好养吗？怎么养？

长城对于抵御古代匈奴和蒙古人起到了多大作用？

什么树可以嫁接腊梅？

行情堪忧，还有多少教育机构的老师们五一假期有课上的？课时量多不多？

在农村“立夏节”都有哪些民间习俗？

男朋友失望分手，但对我还有感觉，答应我两个月之后可以在一起，我应该怎么做，才能改变之前他对我的看法？

工程分包乙方人员伤残谁承担？

有哪些看起来毫不相关的两个历史人物实际上有过联系？

13年雪铁龙世嘉自动挡7万多公里，没有水泡事故，多少钱能买？

22+吃土少女17年就有驾驶证了，今年才开始开车，想买个二手昂克赛拉，或者有什么好建议吗？

如何骑车去台湾骑行？

本人预算5万左右，想买一辆二手法系车！求推荐？

14年进口马自达5PK进口10年道奇酷威买哪个划算？

2020年，河南教育行业国务院特殊津贴推荐，河南大学并列第三，大家怎么看？

本田CRV2019款1.5T舒适版油耗高吗？

国外疫情如果没有得到有效控制，世界会发生什么事情？头脑风暴？

本田XRV这款车的整体表现怎么样？我想买1.5T自动豪华版，全款多少钱？

现在存款有14万，借了5万还没收回来，该做什么好？

2070super和5700xt买哪个比较好？

生完二胎后，感觉自己有点抑郁，总是想发火，特别烦躁，怎么办？

人这一生遇到的人和事为什么感觉都像是必然的经历？

现在校内校外到底教的是美式英语还是英式英语还是混搭英语？

上有老下有小，我们真的跳不出这个人生循环了吗？

如果外面正在下小雨，你会突然想起了谁？

初中同学许久未见大学期间突然联系请吃饭，态度还良好，我给推了，会不会让人很烦？

现在我觉得认真对某个人说我喜欢你什么的这种话好恶心，我爱你更说不出口，好恶心，是什么心理？

剧版的《何以笙箫默》和《再见王沥川》哪一个更好看呢？

计算机专业本科能够进入字节跳动、华为这些公司做开发吗？是否还需要继续读研？

生完二胎的你们，现在有什么感想？

华北适合种植蚕豆吗？

华为手机更新EMUI10.1系统后效果咋样？

大热天蜜蜂老是爬到箱外结群正常吗？

辣椒正是生长最佳期，偏偏有的辣椒苗蔫，不是病虫害是咋回事？

手机相机发展的最终形态会是怎样的？

华为为什么不出一款5寸全面屏手机呢？我想应该会有很多人支持吧？

生吃山芋，生吃胡萝卜，还有哪些蔬菜可以生吃呢？

为什么马铃薯不宜过早过迟播种？

疫情愈发严重，原油为何反而大涨？

生菜球很好吃，怎么种植才能高产呢？

装修高手来帮忙看下144平，套内122平，怎么三房改四房？ ？

大家帮忙看看这个房子如果要砸墙的话，怎么改比较好？

意蜂夏季喝什么水降温？

黄瓜种子催芽后种植需要打底水吗？

书友们展示一下自我感觉发挥较好的作品，一起学习？

酸白菜怎么做？

馄饨皮怎么做？

你会做什么凉拌菜？

东北鸡蛋焖子的做法是什么？

没有内脂，怎样做豆腐脑？

油泼辣子哪家好？

油炸糕怎么样做？

荞面蒸饺的面怎么和好？

农村人冬天都吃啥？

合同纠纷案件发回重审，重审阶段被告没有新证据法院会怎么判决？

“长春人来一个打一个”，网曝一长春男子遭多名沈阳男子围殴, 你怎么看？

一家三口在长春旅游三四天有什么好路线推荐？

受贿人已经逮捕，行贿人没移交有事吗？

请问目前最好用的买火车票的软件有哪几个？

受贿10万元以下，自首并全部退款，按刑法该怎么判？

春运火车票晚上好抢还是早上好抢？

对于铁路12306的候补票，你有什么看法？

怎么解决火车票时间冲突问题？

春运你们是如何抢到火车票的？

携程VIP抢火车票有用吗？

火车票无纸化后，如果买了始发站的车票，必须要换成纸质车票才能进站，那么从中途上车该怎么进站呢？

为什么大多数人更愿意去用飞猪、去哪儿旅行抢火车票也不愿意用12306买票？

哪个抢火车票软件好？

买火车票怎么能买到两张坐在一起的票？有规律吗？

用12306买高铁票，直接刷身份证进站，订票信息里没有写检票口是怎么回事？

12306候补购车票兑现成功率高吗？

黑龙江哪个城市最美？

网上订了火车票，到车站忘记带身份证怎么取票？

装修高手来帮忙看下144平，套内122平，怎么三房改四房？？