第 13 章 Concurrency 并发编程
# 第 13 章 Concurrency 并发编程
by Brett L. Schuchert
“Objects are abstractions of processing. Threads are abstractions of schedule.”
—James O. Coplien1
“对象是过程的抽象。线程是调度的抽象。”
——James O
Writing clean concurrent programs is hard—very hard. It is much easier to write code that executes in a single thread. It is also easy to write multithreaded code that looks fine on the surface but is broken at a deeper level. Such code works fine until the system is placed under stress.
Coplien 编写整洁的并发程序很难——非常难。编写在单线程中执行的代码简单得多。编写表面上看来不错、深入进去却支离破碎的多线程代码也简单。系统一旦遭受压力,这种代码就扛不住了。
In this chapter we discuss the need for concurrent programming, and the difficulties it presents. We then present several recommendations for dealing with those difficulties, and writing clean concurrent code. Finally, we conclude with issues related to testing concurrent code.
本章将讨论并发编程的需求及其困难之处,并给出一些对付这些难点、编写整洁的并发代码的建议。最后,我们将讨论与测试并发代码有关的问题。
Clean Concurrency is a complex topic, worthy of a book by itself. Our strategy in this book is to present an overview here and provide a more detailed tutorial in “Concurrency II” on page 317. If you are just curious about concurrency, then this chapter will suffice for you now. If you have a need to understand concurrency at a deeper level, then you should read through the tutorial as well.
整洁的并发编程是个复杂话题,值得用一整本书来讨论。本书只做概览,并在“并发编程 II”一章中提供更详细的指引。如果你只是对并发好奇,阅读本章就足够了。如果你需要更深入地理解并发,就应读完整个指引章节。
# 13.1 WHY CONCURRENCY? 为什么要并发
Concurrency is a decoupling strategy. It helps us decouple what gets done from when it gets done. In single-threaded applications what and when are so strongly coupled that the state of the entire application can often be determined by looking at the stack backtrace. A programmer who debugs such a system can set a breakpoint, or a sequence of breakpoints, and know the state of the system by which breakpoints are hit.
并发是一种解耦策略。它帮助我们把做什么(目的)和何时(时机)做分解开。在单线程应用中,目的与时机紧密耦合,很多时候只要查看堆栈追踪即可断定应用程序的状态。调试这种系统的程序员可以设定断点或者断点序列,通过查看到达哪个断点来了解系统状态。
Decoupling what from when can dramatically improve both the throughput and structures of an application. From a structural point of view the application looks like many little collaborating computers rather than one big main loop. This can make the system easier to understand and offers some powerful ways to separate concerns.
解耦目的与时机能明显地改进应用程序的吞吐量和结构。从结构的角度来看,应用程序看起来更像是许多台协同工作的计算机,而不是一个大循环。系统因此会更易于被理解,给出了许多切分关注面的有力手段。
Consider, for example, the standard “Servlet” model of Web applications. These systems run under the umbrella of a Web or EJB container that partially manages concurrency for you. The servlets are executed asynchronously whenever Web requests come in. The servlet programmer does not have to manage all the incoming requests. In principle, each servlet execution lives in its own little world and is decoupled from all the other servlet executions.
例如,Web 应用的 Servlet 标准模式。这类系统运行于 Web 或 EJB 容器的保护伞之下,Web 或 EJB 为你部分地处理并发问题。当有 Web 请求时,servlet 就会异步执行。Servlet 程序员无需管理所有的请求。原则上,每次 servlet 是在自己的小世界中执行,与其他 servlet 的执行是分离的。
Of course if it were that easy, this chapter wouldn’t be necessary. In fact, the decoupling provided by Web containers is far less than perfect. Servlet programmers have to be very aware, and very careful, to make sure their concurrent programs are correct. Still, the structural benefits of the servlet model are significant.
当然,如果只是那么简单,也就没必要写这一章了。实际上,Web 容器提供的解耦手段离完美还差得远。Servlet 程序员得非常警惕、非常小心地保证并发程序不出错。同样,servlet 模式的结构性好处还是很明显。
But structure is not the only motive for adopting concurrency. Some systems have response time and throughput constraints that require hand-coded concurrent solutions. For example, consider a single-threaded information aggregator that acquires information from many different Web sites and merges that information into a daily summary. Because this system is single threaded, it hits each Web site in turn, always finishing one before starting the next. The daily run needs to execute in less than 24 hours. However, as more and more Web sites are added, the time grows until it takes more than 24 hours to gather all the data. The single-thread involves a lot of waiting at Web sockets for I/O to complete. We could improve the performance by using a multithreaded algorithm that hits more than one Web site at a time.
但结构并非采用并发的唯一动机。有些系统对响应时间和吞吐量有要求,需要手工编写并发解决方案。例如,考虑一个单线程信息聚合程序,它从许多 Web 站点获取信息,再合并写入日志中。因为该系统是单线程的,它会逐个访问 Web 站点,在开始下一个之前等待当前站点访问完毕。每天的执行时间必须少于 24 个小时。然而,随着要访问的站点越来越多,采集所有数据花费的时间也越来越多,最终超过了 24 个小时的限制。单线程程序许多时间花在等待 Web 套接字 I/O 结束上面。通过采用同时访问多个站点的多线程算法,就能改进性能。
Or consider a system that handles one user at a time and requires only one second of time per user. This system is fairly responsive for a few users, but as the number of users increases, the system’s response time increases. No user wants to get in line behind 150 others! We could improve the response time of this system by handling many users concurrently.
或者,考虑某个每次花费 1 秒钟处理一个用户请求的系统。该系统在用户量较少的时候响应及时,但随着用户数增加,系统的响应时间也增加了。没人想排在 150 个人后面!通过并发处理多个用户请求,就能改进系统响应时间。
Or consider a system that interprets large data sets but can only give a complete solution after processing all of them. Perhaps each data set could be processed on a different computer, so that many data sets are being processed in parallel.
再或者,考虑某个解释大量数据集、但只在处理完全部数据后给出一个完整解决方案的系统。或许可以在独立的计算机上处理每个数据集,那样的话许多数据集就能并行地得到处理。
Myths and Misconceptions
迷思与误解
And so there are compelling reasons to adopt concurrency. However, as we said before, concurrency is hard. If you aren’t very careful, you can create some very nasty situations. Consider these common myths and misconceptions:
看来有足够的理由采用并发方案。然而,如前文所述,并发编程很难。如果你不那么细心,就会搞出不堪入目的东西来。看看以下常见的迷思和误解:
- Concurrency always improves performance. Concurrency can sometimes improve performance, but only when there is a lot of wait time that can be shared between multiple threads or multiple processors. Neither situation is trivial.
- Design does not change when writing concurrent programs. In fact, the design of a concurrent algorithm can be remarkably different from the design of a single-threaded system. The decoupling of what from when usually has a huge effect on the structure of the system.
- Understanding concurrency issues is not important when working with a container such as a Web or EJB container. In fact, you’d better know just what your container is doing and how to guard against the issues of concurrent update and deadlock described later in this chapter.
- (1)并发总能改进性能。并发有时能改进性能,但只在多个线程或处理器之间能分享大量等待时间的时候管用。事情没那么简单。
- (2)编写并发程序无需修改设计。事实上,并发算法的设计有可能与单线程系统的设计极不相同。目的与时机的解耦往往对系统结构产生巨大影响。
- (3)在采用 Web 或 EJB 容器的时候,理解并发问题并不重要。实际上,你最好了解容器在做什么,了解如何对付本章后文将提到的并发更新、死锁等问题。
Here are a few more balanced sound bites regarding writing concurrent software:
下面是一些有关编写并发软件的中肯说法:
- Concurrency incurs some overhead, both in performance as well as writing additional code.
- Correct concurrency is complex, even for simple problems.
- Concurrency bugs aren’t usually repeatable, so they are often ignored as one-offs2 instead of the true defects they are.
- Concurrency often requires a fundamental change in design strategy.
- 并发会在性能和编写额外代码上增加一些开销;
- 正确的并发是复杂的,即便对于简单的问题也是如此;
- 并发缺陷并非总能重现,所以常被看做偶发事件而忽略,未被当做真的缺陷看待;
- 并发常常需要对设计策略的根本性修改。
# 13.2 CHALLENGES 挑战
What makes concurrent programming so difficult? Consider the following trivial class:
并发编程为何如此之难?来看看下面这个小型类:
public class X {
private int lastIdUsed;
public int getNextId() {
return ++lastIdUsed;
}
}
2
3
4
5
6
7
Let’s say we create an instance of X, set the lastIdUsed field to 42, and then share the instance between two threads. Now suppose that both of those threads call the method getNextId(); there are three possible outcomes:
比如,创建 x 的一个实体,将 lastIdUsed 设置为 42,在两个线程中共享这个实体。假设这两个线程都调用 getNextId() 方法,结果可能有三种输出:
- Thread one gets the value 43, thread two gets the value 44, lastIdUsed is 44.
- Thread one gets the value 44, thread two gets the value 43, lastIdUsed is 44.
- Thread one gets the value 43, thread two gets the value 43, lastIdUsed is 43.
- 线程一得到值 43,线程二得到值 44,lastIdUsed 为 44;
- 线程一得到值 44,线程二得到值 43,lastIdUsed 为 44;
- 线程一得到值 43,线程二得到值 43,lastIdUsed 为 43。
The surprising third result3 occurs when the two threads step on each other. This happens because there are many possible paths that the two threads can take through that one line of Java code, and some of those paths generate incorrect results. How many different paths are there? To really answer that question, we need to understand what the Just-In-Time Compiler does with the generated byte-code, and understand what the Java memory model considers to be atomic.
第三种结果令人惊异,当两个线程相互影响时就会出现这种情况。这是因为线程在执行那行 Java 代码时有许多可能路径可行,有些路径会产生错误的结果。有多少种不同路径呢?要真正回答这个问题,需要理解 Just-In-Time 编译器如何对待生成的字节码,还要理解 Java 内存模型认为什么东西具有原子性。
A quick answer, working with just the generated byte-code, is that there are 12,870 different possible execution paths4 for those two threads executing within the getNextId method. If the type of lastIdUsed is changed from int to long, the number of possible paths increases to 2,704,156. Of course most of those paths generate valid results. The problem is that some of them don’t.
简答一下,就生成的字节码而言,对于在 getNextId 方法中执行的那两个线程,有 12870 种不同的可能执行路径。如果 lastIdUsed 的类型从 int 变为 long,则可能路径的数量将增至 2704156 种。当然,多数路径都得到正确结果。问题是其中一些不能得到正确结果。
# 13.3 CONCURRENCY DEFENSE PRINCIPLES 并发防御原则
What follows is a series of principles and techniques for defending your systems from the problems of concurrent code.
下面给出一系列防御并发代码问题的原则和技巧。
# 13.3.1 Single Responsibility Principle 单一权责原则
The SRP5 states that a given method/class/component should have a single reason to change. Concurrency design is complex enough to be a reason to change in it’s own right and therefore deserves to be separated from the rest of the code. Unfortunately, it is all too common for concurrency implementation details to be embedded directly into other production code. Here are a few things to consider:
单一权责原则(SRP)认为,方法/类/组件应当只有一个修改的理由。并发设计自身足够复杂到成为修改的理由,所以也该从其他代码中分离出来。不幸的是,并发实现细节常常直接嵌入到其他生产代码中。下面是要考虑的一些问题:
- Concurrency-related code has its own life cycle of development, change, and tuning.
- Concurrency-related code has its own challenges, which are different from and often more difficult than nonconcurrency-related code.
- The number of ways in which miswritten concurrency-based code can fail makes it challenging enough without the added burden of surrounding application code.
- 并发相关代码有自己的开发、修改和调优生命周期;
- 开发相关代码有自己要对付的挑战,和非并发相关代码不同,而且往往更为困难;
- 即便没有周边应用程序增加的负担,写得不好的并发代码可能的出错方式数量也已经足具挑战性。
Recommendation: Keep your concurrency-related code separate from other code.6
建议:分离并发相关代码与其他代码。
# 13.3.2 Corollary: Limit the Scope of Data 推论:限制数据作用域
As we saw, two threads modifying the same field of a shared object can interfere with each other, causing unexpected behavior. One solution is to use the synchronized keyword to protect a critical section in the code that uses the shared object. It is important to restrict the number of such critical sections. The more places shared data can get updated, the more likely:
如我们所见,两个线程修改共享对象的同一字段时,可能互相干扰,导致未预期的行为。解决方案之一是采用 synchronized 关键字在代码中保护一块使用共享对象的临界区(critical section)。限制临界区的数量很重要。更新共享数据的地方越多,就越可能:
- You will forget to protect one or more of those places—effectively breaking all code that modifies that shared data.
- There will be duplication of effort required to make sure everything is effectively guarded (violation of DRY7).
- It will be difficult to determine the source of failures, which are already hard enough to find.
- 你会忘记保护一个或多个临界区——破坏了修改共享数据的代码;
- 得多花力气保证一切都受到有效防护(破坏了 DRY 原则);
- 很难找到错误源,也很难判断错误源。
Recommendation: Take data encapsulation to heart; severely limit the access of any data that may be shared.
建议:谨记数据封装;严格限制对可能被共享的数据的访问。
# 13.3.3 Corollary: Use Copies of Data 推论:使用数据复本
A good way to avoid shared data is to avoid sharing the data in the first place. In some situations it is possible to copy objects and treat them as read-only. In other cases it might be possible to copy objects, collect results from multiple threads in these copies and then merge the results in a single thread.
避免共享数据的好方法之一就是一开始就避免共享数据。在某些情形下,有可能复制对象并以只读方式对待。在另外的情况下,有可能复制对象,从多个线程收集所有复本的结果,并在单个线程中合并这些结果。
If there is an easy way to avoid sharing objects, the resulting code will be far less likely to cause problems. You might be concerned about the cost of all the extra object creation. It is worth experimenting to find out if this is in fact a problem. However, if using copies of objects allows the code to avoid synchronizing, the savings in avoiding the intrinsic lock will likely make up for the additional creation and garbage collection overhead.
如果有避免共享数据的简易手段,结果代码就会大大减少导致错误的可能。你可能会关心创建额外对象的成本。值得试验一下看看那是否真是个问题。然而,假使使用对象复本能避免代码同步执行,则因避免了锁定而省下的价值有可能补偿得上额外的创建成本和垃圾收集开销。
# 13.3.4 Corollary: Threads Should Be as Independent as Possible 推论:线程应尽可能地独立
Consider writing your threaded code such that each thread exists in its own world, sharing no data with any other thread. Each thread processes one client request, with all of its required data coming from an unshared source and stored as local variables. This makes each of those threads behave as if it were the only thread in the world and there were no synchronization requirements.
让每个线程在自己的世界中存在,不与其他线程共享数据。每个线程处理一个客户端请求,从不共享的源头接纳所有请求数据,存储为本地变量。这样一来,每个线程都像是世界中的唯一线程,没有同步需要。
For example, classes that subclass from HttpServlet receive all of their information as parameters passed in to the doGet and doPost methods. This makes each Servlet act as if it has its own machine. So long as the code in the Servlet uses only local variables, there is no chance that the Servlet will cause synchronization problems. Of course, most applications using Servlets eventually run into shared resources such as database connections.
例如,HttpServlet 的子类接收所有以参数形式传递给 doGet 和 doPost 方法的信息。每个 Servlet 都像拥有独立虚拟机一般运行。只要 Servlet 中的代码只使用本地变量,Servlet 就不会导致同步问题。当然,多数使用 Servlet 的应用程序最终都还是会用到类似数据库连接之类的共享资源。
Recommendation: Attempt to partition data into independent subsets than can be operated on by independent threads, possibly in different processors.
建议:尝试将数据分解到可被独立线程(可能在不同处理器上)操作的独立子集。
# 13.4 KNOW YOUR LIBRARY 了解 Java 库
Java 5 offers many improvements for concurrent development over previous versions. There are several things to consider when writing threaded code in Java 5:
相对于之前的版本,Java 5 提供了许多并发开发方面的改进。在用 Java 5 编写线程代码时,要注意以下几点:
- Use the provided thread-safe collections.
- Use the executor framework for executing unrelated tasks.
- Use nonblocking solutions when possible.
- Several library classes are not thread safe.
- 使用类库提供的线程安全群集;
- 使用 executor 框架(executor framework)执行无关任务;
- 尽可能使用非锁定解决方案;
- 有几个类并不是线程安全的。
Thread-Safe Collections 线程安全群集 When Java was young, Doug Lea wrote the seminal book8 Concurrent Programming in Java. Along with the book he developed several thread-safe collections, which later became part of the JDK in the java.util.concurrent package. The collections in that package are safe for multithreaded situations and they perform well. In fact, the ConcurrentHashMap implementation performs better than HashMap in nearly all situations. It also allows for simultaneous concurrent reads and writes, and it has methods supporting common composite operations that are otherwise not thread safe. If Java 5 is the deployment environment, start with ConcurrentHashMap.
当 Java 还年轻时,Doug Lea 编写了 Concurrent Programming in Java(中译版《Java 并发编程》)教程,同时开发了几个线程安全群集,这些代码后来成为 JDK 中 java.util.concurrent 包的一部分。该代码包中的群集对于多线程解决方案是安全的,执行良好。实际上,在几乎所有情况下,ConcurrentHashMap 实现都比 HashMap 表现得好。它还支持同步并发读写,也拥有支持非线程安全的合成操作的方法。如果部署环境是 Java 5,可以采用 ConcurrentHashMap。
There are several other kinds of classes added to support advanced concurrency design. Here are a few examples:
还有几个支持高级并发设计的类。以下是其中一小部分,如表 13-1 所示。
Recommendation: Review the classes available to you. In the case of Java, become familiar with java.util.concurrent, java.util.concurrent.atomic, java.util.concurrent.locks.
建议:检读可用的类。对于 Java,掌握 java.util.concurrent、 java.util.concurrent.atomic 和 java.util.concurrent.locks。
# 13.5 KNOW YOUR EXECUTION MODELS 了解执行模型
There are several different ways to partition behavior in a concurrent application. To discuss them we need to understand some basic definitions.
有几种在并发应用中切分行为的途径。要讨论这些途径,我们需要理解一些基础定义,如表 13-2 所示。
Given these definitions, we can now discuss the various execution models used in concurrent programming.
有了这些定义,我们就能讨论在并发编程中用到的几种执行模型了。
# 13.5.1 Producer-Consumer 生产者-消费者模型
One or more producer threads create some work and place it in a buffer or queue. One or more consumer threads acquire that work from the queue and complete it. The queue between the producers and consumers is a bound resource. This means producers must wait for free space in the queue before writing and consumers must wait until there is something in the queue to consume. Coordination between the producers and consumers via the queue involves producers and consumers signaling each other. The producers write to the queue and signal that the queue is no longer empty. Consumers read from the queue and signal that the queue is no longer full. Both potentially wait to be notified when they can continue.
一个或多个生产者线程创建某些工作,并置于缓存或队列中。一个或多个消费者线程从队列中获取并完成这些工作。生产者和消费者之间的队列是一种限定资源。
# 13.5.2 Readers-Writers 读者-作者模型
When you have a shared resource that primarily serves as a source of information for readers, but which is occasionally updated by writers, throughput is an issue. Emphasizing throughput can cause starvation and the accumulation of stale information. Allowing updates can impact throughput. Coordinating readers so they do not read something a writer is updating and vice versa is a tough balancing act. Writers tend to block many readers for a long period of time, thus causing throughput issues.
当存在一个主要为读者线程提供信息源,但只偶尔被作者线程更新的共享资源,吞吐量就会是个问题。增加吞吐量,会导致线程饥饿和过时信息的累积。更新会影响吞吐量。协调读者线程,不去读作者线程正在更新的信息(反之亦然),这是一种辛苦的平衡工作。作者线程倾向于长期锁定许多读者线程,从而导致吞吐量问题。
The challenge is to balance the needs of both readers and writers to satisfy correct operation, provide reasonable throughput and avoiding starvation. A simple strategy makes writers wait until there are no readers before allowing the writer to perform an update. If there are continuous readers, however, the writers will be starved. On the other hand, if there are frequent writers and they are given priority, throughput will suffer. Finding that balance and avoiding concurrent update issues is what the problem addresses.
挑战之处在于平衡读者线程和作者线程的需求,实现正确操作,提供合理的吞吐量,避免线程饥饿。
# 13.5.3 Dining Philosophers 宴席哲学家
Imagine a number of philosophers sitting around a circular table. A fork is placed to the left of each philosopher. There is a big bowl of spaghetti in the center of the table. The philosophers spend their time thinking unless they get hungry. Once hungry, they pick up the forks on either side of them and eat. A philosopher cannot eat unless he is holding two forks. If the philosopher to his right or left is already using one of the forks he needs, he must wait until that philosopher finishes eating and puts the forks back down. Once a philosopher eats, he puts both his forks back down on the table and waits until he is hungry again.
想象一下,一群哲学家环坐在圆桌旁。每个哲学家的左手边放了一把叉子。桌面中央摆着一大碗意大利面。哲学家们思索良久,直至肚子饿了。每个人都要拿起叉子吃饭。但除非手上有两把叉子,否则就没法进食。如果左边或右边的哲学家已经取用一把叉子,中间这位就得等到别人吃完、放回叉子。每位哲学家吃完后,就将两把叉子放回桌面,直到肚子再饿。
Replace philosophers with threads and forks with resources and this problem is similar to many enterprise applications in which processes compete for resources. Unless carefully designed, systems that compete in this way can experience deadlock, livelock, throughput, and efficiency degradation.
用线程代替哲学家,用资源代替叉子,就变成了许多企业级应用中进程竞争资源的情形。如果没有用心设计,这种竞争式系统就会遭遇死锁、活锁、吞吐量和效率降低等问题。
Most concurrent problems you will likely encounter will be some variation of these three problems. Study these algorithms and write solutions using them on your own so that when you come across concurrent problems, you’ll be more prepared to solve the problem.
你可能遇到的并发问题,大多数都是这三个问题的变种。请研究并使用这些算法,这样,遇到并发问题时你就能有解决问题的准备了。
Recommendation: Learn these basic algorithms and understand their solutions.
建议:学习这些基础算法,理解其解决方案。
# 13.6 BEWARE DEPENDENCIES BETWEEN SYNCHRONIZED METHODS 警惕同步方法之间的依赖
Dependencies between synchronized methods cause subtle bugs in concurrent code. The Java language has the notion of synchronized, which protects an individual method. However, if there is more than one synchronized method on the same shared class, then your system may be written incorrectly.12
同步方法之间的依赖会导致并发代码中的狡猾缺陷。Java 语言有 synchronized 概念,可以用来保护单个方法。然而,如果在同一共享类中有多个同步方法,系统就可能写得不太正确了。
Recommendation: Avoid using more than one method on a shared object.
建议:避免使用一个共享对象的多个方法。
There will be times when you must use more than one method on a shared object. When this is the case, there are three ways to make the code correct:
有时必须使用一个共享对象的多个方法。在这种情况发生时,有 3 种写对代码的手段:
- Client-Based Locking—Have the client lock the server before calling the first method and make sure the lock’s extent includes code calling the last method.
- Server-Based Locking—Within the server create a method that locks the server, calls all the methods, and then unlocks. Have the client call the new method.
- Adapted Server—create an intermediary that performs the locking. This is an example of server-based locking, where the original server cannot be changed.
- 基于客户端的锁定——客户端代码在调用第一个方法前锁定服务端,确保锁的范围覆盖了调用最后一个方法的代码;
- 基于服务端的锁定——在服务端内创建锁定服务端的方法,调用所有方法,然后解锁。让客户端代码调用新方法;
- 适配服务端——创建执行锁定的中间层。这是一种基于服务端的锁定的例子,但不修改原始服务端代码。
# 13.7 KEEP SYNCHRONIZED SECTIONS SMALL 保持同步区域微小
The synchronized keyword introduces a lock. All sections of code guarded by the same lock are guaranteed to have only one thread executing through them at any given time. Locks are expensive because they create delays and add overhead. So we don’t want to litter our code with synchronized statements. On the other hand, critical sections13 must be guarded. So we want to design our code with as few critical sections as possible.
关键字 synchronized 制造了锁。同一个锁维护的所有代码区域在任一时刻保证只有一个线程执行。锁是昂贵的,因为它们带来了延迟和额外开销。所以我们不愿将代码扔给 synchronized 语句了事。另一方面,临界区应该被保护起来。所以,应该尽可能少地设计临界区。
Some naive programmers try to achieve this by making their critical sections very large. However, extending synchronization beyond the minimal critical section increases contention and degrades performance.
有些天真的程序员想通过扩大临界区面积达到这个目的。然而,将同步延展到最小临界区范围之外,会增加资源争用、降低执行效率。
Recommendation: Keep your synchronized sections as small as possible.
建议:尽可能减小同步区域。
# 13.8 WRITING CORRECT SHUT-DOWN CODE IS HARD 很难编写正确的关闭代码
Writing a system that is meant to stay live and run forever is different from writing something that works for awhile and then shuts down gracefully.
编写永远运行的系统,与编写运行一段时间后平静地关闭的系统是两码事。
Graceful shutdown can be hard to get correct. Common problems involve deadlock,15 with threads waiting for a signal to continue that never comes.
平静关闭很难做到。常见问题与死锁有关,线程一直等待永远不会到来的信号。
For example, imagine a system with a parent thread that spawns several child threads and then waits for them all to finish before it releases its resources and shuts down. What if one of the spawned threads is deadlocked? The parent will wait forever, and the system will never shut down.
例如,想象一个系统中有个父线程分裂出数个子线程,父线程等待所有子线程结束,然后释放资源并关闭。如果其中一个子线程发生死锁会怎样?父线程将一直等待下去,而系统就永远不能关闭。
Or consider a similar system that has been instructed to shut down. The parent tells all the spawned children to abandon their tasks and finish. But what if two of the children were operating as a producer/consumer pair. Suppose the producer receives the signal from the parent and quickly shuts down. The consumer might have been expecting a message from the producer and be blocked in a state where it cannot receive the shutdown signal. It could get stuck waiting for the producer and never finish, preventing the parent from finishing as well.
或者,考虑一个被指示关闭的类似系统。父线程告知全体子线程放弃任务并结束。如果其中两个子线程正以生产者/消费者模型操作会怎样呢?假设生产者线程从父线程处接收到信号,并迅速关闭。消费者线程可能还在等待生产者线程发来消息,于是就被锁定在无法接收到关闭信号的状态中。它会死等生产者线程,永不结束,从而导致父线程也无法结束。
Situations like this are not at all uncommon. So if you must write concurrent code that involves shutting down gracefully, expect to spend much of your time getting the shutdown to happen correctly.
这类情形并非那么不常见。如果你要编写涉及平静关闭的并发代码,请多预留一些时间搞对关闭过程。
Recommendation: Think about shut-down early and get it working early. It’s going to take longer than you expect. Review existing algorithms because this is probably harder than you think.
建议:尽早考虑关闭问题,尽早令其工作正常。这会花费比你预期更多的时间。检视既有算法,因为这可能会比想象中难得多。
# 13.9 TESTING THREADED CODE 测试线程代码
Proving that code is correct is impractical. Testing does not guarantee correctness. However, good testing can minimize risk. This is all true in a single-threaded solution. As soon as there are two or more threads using the same code and working with shared data, things get substantially more complex.
证明代码的正确性不切实际。测试并不能确保正确性。然而,好的测试却能尽量降低风险。这对于所有单线程解决方案都是对的。当有两个或多个线程使用同一代码段和共享数据,事情就变得非常复杂了。
Recommendation: Write tests that have the potential to expose problems and then run them frequently, with different programatic configurations and system configurations and load. If tests ever fail, track down the failure. Don’t ignore a failure just because the tests pass on a subsequent run.
建议:编写有潜力曝露问题的测试,在不同的编程配置、系统配置和负载条件下频繁运行。如果测试失败,跟踪错误。别因为后来测试通过了后来的运行就忽略失败。
That is a whole lot to take into consideration. Here are a few more fine-grained recommendations:
有一大堆问题要考虑。下面是一些精练的建议:
- Treat spurious failures as candidate threading issues.
- Get your nonthreaded code working first.
- Make your threaded code pluggable.
- Make your threaded code tunable.
- Run with more threads than processors.
- Run on different platforms.
- Instrument your code to try and force failures.
- 将伪失败看作可能的线程问题;
- 先使非线程代码可工作;
- 编写可插拔的线程代码;
- 编写可调整的线程代码;
- 运行多于处理器数量的线程;
- 在不同平台上运行;
- 调整代码并强迫错误发生。
# 13.9.1 Treat Spurious Failures as Candidate Threading Issues 将伪失败看作可能的线程问题
Threaded code causes things to fail that “simply cannot fail.” Most developers do not have an intuitive feel for how threading interacts with other code (authors included). Bugs in threaded code might exhibit their symptoms once in a thousand, or a million, executions. Attempts to repeat the systems can be frustratingly. This often leads developers to write off the failure as a cosmic ray, a hardware glitch, or some other kind of “one-off.” It is best to assume that one-offs do not exist. The longer these “one-offs” are ignored, the more code is built on top of a potentially faulty approach.
线程代码导致“不可能失败的”失败。多数开发者缺乏有关线程如何与其他代码(可能由其他作者编写)互动的直觉。线程代码中的缺陷可能在一千或一百万次执行中才会显现一次。重复执行想要复现问题令人沮丧。所以开发者常常会将失败归咎于宇宙射线、硬件错误或其他“偶发事件”。最好假设这种偶发事件根本不存在。“偶发事件”被忽略得越久,代码就越有可能搭建于不完善的基础之上。
Recommendation: Do not ignore system failures as one-offs.
建议:不要将系统错误归咎于偶发事件。
# 13.9.2 Get Your Nonthreaded Code Working First 先使非线程代码可工作
This may seem obvious, but it doesn’t hurt to reinforce it. Make sure code works outside of its use in threads. Generally, this means creating POJOs that are called by your threads. The POJOs are not thread aware, and can therefore be tested outside of the threaded environment. The more of your system you can place in such POJOs, the better.
这看起来太浅显,但强调一下不无益处。确保线程之外的代码可工作。通常,这意味着创建由线程调用的 POJO。POJO 与线程无涉,所以可在线程环境之外测试。能放进 POJO 中的代码越多越好。
Recommendation: Do not try to chase down nonthreading bugs and threading bugs at the same time. Make sure your code works outside of threads.
建议:不要同时追踪非线程缺陷和线程缺陷。确保代码在线程之外可工作。
# 13.9.3 Make Your Threaded Code Pluggable 编写可插拔的线程代码
Write the concurrency-supporting code such that it can be run in several configurations:
编写可在数个配置环境下运行的线程代码:
- One thread, several threads, varied as it executes
- Threaded code interacts with something that can be both real or a test double.
- Execute with test doubles that run quickly, slowly, variable.
- Configure tests so they can run for a number of iterations.
- 单线程与多个线程在执行时不同的情况;
- 线程代码与实物或测试替身互动;
- 用运行快速、缓慢和有变动的测试替身执行;
- 将测试配置为能运行一定数量的迭代。
Recommendation: Make your thread-based code especially pluggable so that you can run it in various configurations.
建议:编写可插拔的线程代码,这样就能在不同的配置环境下运行。
# 13.9.4 Make Your Threaded Code Tunable 编写可调整的线程代码
Getting the right balance of threads typically requires trial an error. Early on, find ways to time the performance of your system under different configurations. Allow the number of threads to be easily tuned. Consider allowing it to change while the system is running. Consider allowing self-tuning based on throughput and system utilization.
要获得良好的线程平衡,常常需要试错。一开始,在不同的配置环境下监测系统性能。要允许线程数量可调整。在系统运行时允许线程发生变动。允许线程依据吞吐量和系统使用率自我调整。
# 13.9.5 Run with More Threads Than Processors 运行多于处理器数量的线程
Things happen when the system switches between tasks. To encourage task swapping, run with more threads than processors or cores. The more frequently your tasks swap, the more likely you’ll encounter code that is missing a critical section or causes deadlock.
系统在切换任务时会发生一些事。为了促使任务交换的发生,运行多于处理器或处理器核心数量的线程。任务交换越频繁,越有可能找到错过临界区或导致死锁的代码。
# 13.9.6 Run on Different Platforms 在不同平台上运行
In the middle of 2007 we developed a course on concurrent programming. The course development ensued primarily under OS X. The class was presented using Windows XP running under a VM. Tests written to demonstrate failure conditions did not fail as frequently in an XP environment as they did running on OS X.
2007 年,我们做了一套关于并发编程的课程。该课程主要在 OS X 下开发,在运行于虚拟机的 Windows XP 上展示。用于演示的测试失败条件,在 OS X 上要比在 XP 上失败得更频繁。
In all cases the code under test was known to be incorrect. This just reinforced the fact that different operating systems have different threading policies, each of which impacts the code’s execution. Multithreaded code behaves differently in different environments.16 You should run your tests in every potential deployment environment.
被测试的代码已知是不正确的。这正强调了不同操作系统有着不同线程策略的事实,不同的线程策略影响了代码的执行。在不同环境中,多线程代码的行为也不一样。应该在所有可能部署的环境中运行测试。
Recommendation: Run your threaded code on all target platforms early and often.
建议:尽早并经常地在所有目标平台上运行线程代码。
# 13.9.7 Instrument Your Code to Try and Force Failures 装置试错代码
It is normal for flaws in concurrent code to hide. Simple tests often don’t expose them. Indeed, they often hide during normal processing. They might show up once every few hours, or days, or weeks!
并发代码中藏有缺陷,这并不罕见。简单的测试往往无法曝露这些缺陷。实际上,缺陷经常隐藏于一般处理过程中。可能好几个小时、好几天甚至好几个星期才会跳出来一次!
The reason that threading bugs can be infrequent, sporadic, and hard to repeat, is that only a very few pathways out of the many thousands of possible pathways through a vulnerable section actually fail. So the probability that a failing pathway is taken can be star-tlingly low. This makes detection and debugging very difficult.
线程中的缺陷之所以如此不频繁、偶发、难以重现,是因为在几千个穿过脆弱区域的可能路径当中,只有少数路径会真的导致失败。经过会导致失败的路径的可能性惊人地低。所以,侦测与调试也非常之难。
How might you increase your chances of catching such rare occurrences? You can instrument your code and force it to run in different orderings by adding calls to methods like Object.wait(), Object.sleep(), Object.yield() and Object.priority().
怎么才能增加捕捉住如此罕见之物的机会?可以装置代码,增加对 Object.wait( )、Object.sleep( )、Object.yield( )和 Object.priority( )等方法的调用,改变代码执行顺序。
Each of these methods can affect the order of execution, thereby increasing the odds of detecting a flaw. It’s better when broken code fails as early and as often as possible.
这些方法都会影响执行顺序,从而增加了侦测到缺陷的可能性。有问题的代码,最好尽早、尽可能多地通不过测试。
There are two options for code instrumentation:
有两种装置代码的方法:
- Hand-coded
- Automated
- 硬编码;
- 自动化。
# 13.9.8 Hand-Coded 硬编码
You can insert calls to wait(), sleep(), yield(), and priority() in your code by hand. It might be just the thing to do when you’re testing a particularly thorny piece of code.
你可以手工向代码中插入 wait()、sleep()、yield() 和 priority() 的调用。在测试某段棘手的代码时,正当如此操作。
Here is an example of doing just that:
下面是个例子:
public synchronized String nextUrlOrNull() {
if(hasNext()) {
String url = urlGenerator.next();
Thread.yield(); // inserted for testing.
updateHasNext();
return url;
}
return null;
}
2
3
4
5
6
7
8
9
The inserted call to yield() will change the execution pathways taken by the code and possibly cause the code to fail where it did not fail before. If the code does break, it was not because you added a call to yield().17 Rather, your code was broken and this simply made the failure evident.
插入对 yield() 的调用,将改变代码的执行路径,由此而可能导致代码在以前未失败过的地方失败。如果代码的确出错,那并非是因为你插入了 yield() 方法调用。代码出错了,这便是失败的原因。
There are many problems with this approach:
这种手法有许多毛病:
- You have to manually find appropriate places to do this.
- How do you know where to put the call and what kind of call to use?
- Leaving such code in a production environment unnecessarily slows the code down.
- It’s a shotgun approach. You may or may not find flaws. Indeed, the odds aren’t with you.
- 你得手工找到合适的地方来插入方法调用;你怎么知道在哪里插入调用、插入什么调用?
- 不必要地在产品环境中留下这类代码,将拖慢代码执行速度;
- 这是种无的放矢的手段。你可能找不到缺陷。实际上,这不在你把握之中。
What we need is a way to do this during testing but not in production. We also need to easily mix up configurations between different runs, which results in increased chances of finding errors in the aggregate.
我们所需要的,是一种在测试中但不在生产中实现的手段。我们还需要为多次运行轻易地调整配置,从而增加总的发现错误机会。
Clearly, if we divide our system up into POJOs that know nothing of threading and classes that control the threading, it will be easier to find appropriate places to instrument the code. Moreover, we could create many different test jigs that invoke the POJOs under different regimes of calls to sleep, yield, and so on.
无疑,如果将系统分解为对线程及控制线程的类一无所知的 POJO,就能更容易地找到装置代码的位置。而且,还能创建许多个以不同方式调用 sleep、yield 等方法的 POJO 测试。
# 13.9.9 Automated 自动化
You could use tools like an Aspect-Oriented Framework, CGLIB, or ASM to programmatically instrument your code. For example, you could use a class with a single method:
可以使用 Aspect-Oriented Framework、CGLIB 或 ASM 之类工具通过编程来装置代码。例如,可以使用有单个方法的类:
public class ThreadJigglePoint {
public static void jiggle() {
}
}
2
3
4
You can add calls to this in various places within your code:
可以在代码的不同位置调用这个方法:
public synchronized String nextUrlOrNull() {
if(hasNext()) {
ThreadJiglePoint.jiggle();
String url = urlGenerator.next();
ThreadJiglePoint.jiggle();
updateHasNext();
ThreadJiglePoint.jiggle();
return url;
}
return null;
}
2
3
4
5
6
7
8
9
10
11
Now you use a simple aspect that randomly selects among doing nothing, sleeping, or yielding.
如此,你就得到了一个随机选择无所作为、睡眠或让步的方面。
Or imagine that the ThreadJigglePoint class has two implementations. The first implements jiggle to do nothing and is used in production. The second generates a random number to choose between sleeping, yielding, or just falling through. If you run your tests a thousand times with random jiggling, you may root out some flaws. If the tests pass, at least you can say you’ve done due diligence. Though a bit simplistic, this could be a reasonable option in lieu of a more sophisticated tool.
或者,想象 ThreadJigglePoint 类有两种实现。第一种实现 jiggle 什么都不做,在生产环境中使用。第二种实现生成一个随机数,在睡眠、让步或径直执行间做选择。如果上千次地做这种随机测试,大概就能找到一些缺陷的根源。假如测试都通过了,至少你可以说自己已谨慎对待。这种方法看似有点过于简单,但确是替代复杂工具的一种可选方案。
There is a tool called ConTest,18 developed by IBM that does something similar, but it does so with quite a bit more sophistication.
有一种叫做 ConTest 的工具,由 IBM 开发,能做类似的事情,但做法却稍微复杂些。
The point is to jiggle the code so that threads run in different orderings at different times. The combination of well-written tests and jiggling can dramatically increase the chance finding errors.
要点是让代码“异动”,从而使线程以不同次序执行。编写良好的测试与“异动”相组合,能有效地增加发现错误的机会。
Recommendation: Use jiggling strategies to ferret out errors.
建议:使用异动策略搜出错误。
# 13.10 CONCLUSION 小结
Concurrent code is difficult to get right. Code that is simple to follow can become nightmarish when multiple threads and shared data get into the mix. If you are faced with writing concurrent code, you need to write clean code with rigor or else face subtle and infrequent failures.
并发代码很难写正确。加入多线程和共享数据后,简单的代码也会变成噩梦。要编写并发代码,就得严格地编写整洁的代码,否则将面临微细和不频繁发生的失败。
First and foremost, follow the Single Responsibility Principle. Break your system into POJOs that separate thread-aware code from thread-ignorant code. Make sure when you are testing your thread-aware code, you are only testing it and nothing else. This suggests that your thread-aware code should be small and focused.
第一要诀是遵循单一权责原则。将系统切分为分离了线程相关代码和线程无关代码的 POJO。确保在测试线程相关代码时只是在测试,没有做其他事情。线程相关代码应该保持短小和目的集中。
Know the possible sources of concurrency issues: multiple threads operating on shared data, or using a common resource pool. Boundary cases, such as shutting down cleanly or finishing the iteration of a loop, can be especially thorny.
了解并发问题的可能原因:对共享数据的多线程操作,或使用了公共资源池。类似平静关闭或停止循环之类边界情况尤其棘手。
Learn your library and know the fundamental algorithms. Understand how some of the features offered by the library support solving problems similar to the fundamental algorithms.
学习类库,了解基本算法。理解类库提供的与基础算法类似的解决问题的特性。
Learn how to find regions of code that must be locked and lock them. Do not lock regions of code that do not need to be locked. Avoid calling one locked section from another. This requires a deep understanding of whether something is or is not shared. Keep the amount of shared objects and the scope of the sharing as narrow as possible. Change designs of the objects with shared data to accommodate clients rather than forcing clients to manage shared state.
学习如何找到必须锁定的代码区域并锁定之。不要锁定不必锁定的代码。避免从锁定区域中调用其他锁定区域。这需要深刻理解某物是否已共享。尽可能减少共享对象和共享范围。修改对象的设计,向客户代码提供共享数据,而不是迫使客户代码管理共享状态。
Issues will crop up. The ones that do not crop up early are often written off as a onetime occurrence. These so-called one-offs typically only happen under load or at seemingly random times. Therefore, you need to be able to run your thread-related code in many configurations on many platforms repeatedly and continuously. Testability, which comes naturally from following the Three Laws of TDD, implies some level of plug-ability, which offers the support necessary to run code in a wider range of configurations.
问题会跳出来。那种在早期没跳出来的问题往往是偶发的。这种所谓偶发问题,通常仅在高负载下出现或者偶然出现。所以,你要能在不同平台上、以不同配置持续重复运行线程代码。跟随 TDD 三要则而来的可测试性意味着某种程度的可插拔性,从而提供了在大量不同配置下运行代码的必要支持。
You will greatly improve your chances of finding erroneous code if you take the time to instrument your code. You can either do so by hand or using some kind of automated technology. Invest in this early. You want to be running your thread-based code as long as possible before you put it into production.
如果花点时间装置代码,就能极大地提升发现错误代码的机会。可以手工做,也可以使用某种自动化技术。尽早这么做。在将线程代码投入生产环境前,就要尽可能多地运行它。
If you take a clean approach, your chances of getting it right increase drastically.
只要采用了整洁的做法,做对的可能性就有翻天覆地的提高。